An advanced singing voice synthesis system with high fidelity, expressiveness, controllability and flexibility based on DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
Hi there,
I am working on a building a new dataset in Spanish (polysyllabic language). I have gone though MakeDiffSinger but I still have some gaps. I would be grateful if you could sanity check me on my understanding and share any thoughts you might have
Questions for clarifications:
_phseq: These are sequences of phonemes or syllables?
Currently I using phonemes and their timestamps as provided by MFA. I am using a pre-trained Spanish model available by MFA. Would you recommend training a new one on my specific data?
_notedur: The midi notes should be estimated over phonemes, syllables, or words?
Now I estimated one note for each phoneme and assumed ph_dur==note_dure
_phnum: The number of phonemes in each word or in each syllable?
Now I assumed the number of phonemes in each word
_noteseq: Do you think SOME would suffice to get a first shot at this ? I would speculate yes?
_isslur: how would you define slur in this context? I have not found plenty of resources on this topic
Now I assumed no slurs at all
SPs and APs: Would you recommend doing that manually or using the enhance script might be OK for a first shot?
Hi there, I am working on a building a new dataset in Spanish (polysyllabic language). I have gone though MakeDiffSinger but I still have some gaps. I would be grateful if you could sanity check me on my understanding and share any thoughts you might have
Questions for clarifications:
_phseq: These are sequences of phonemes or syllables? Currently I using phonemes and their timestamps as provided by MFA. I am using a pre-trained Spanish model available by MFA. Would you recommend training a new one on my specific data?
_notedur: The midi notes should be estimated over phonemes, syllables, or words? Now I estimated one note for each phoneme and assumed ph_dur==note_dure
_phnum: The number of phonemes in each word or in each syllable? Now I assumed the number of phonemes in each word
_noteseq: Do you think SOME would suffice to get a first shot at this ? I would speculate yes?
_isslur: how would you define slur in this context? I have not found plenty of resources on this topic Now I assumed no slurs at all
SPs and APs: Would you recommend doing that manually or using the enhance script might be OK for a first shot?
Thanks!