openvpi / DiffSinger

An advanced singing voice synthesis system with high fidelity, expressiveness, controllability and flexibility based on DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
Apache License 2.0
2.73k stars 288 forks source link

Clarifications on annotation #211

Open MikeMpapa opened 1 month ago

MikeMpapa commented 1 month ago

Hi there, I am working on a building a new dataset in Spanish (polysyllabic language). I have gone though MakeDiffSinger but I still have some gaps. I would be grateful if you could sanity check me on my understanding and share any thoughts you might have

Questions for clarifications:

  1. _phseq: These are sequences of phonemes or syllables? Currently I using phonemes and their timestamps as provided by MFA. I am using a pre-trained Spanish model available by MFA. Would you recommend training a new one on my specific data?

  2. _notedur: The midi notes should be estimated over phonemes, syllables, or words? Now I estimated one note for each phoneme and assumed ph_dur==note_dure

  3. _phnum: The number of phonemes in each word or in each syllable? Now I assumed the number of phonemes in each word

  4. _noteseq: Do you think SOME would suffice to get a first shot at this ? I would speculate yes?

  5. _isslur: how would you define slur in this context? I have not found plenty of resources on this topic Now I assumed no slurs at all

  6. SPs and APs: Would you recommend doing that manually or using the enhance script might be OK for a first shot?

Thanks!