transcription - Githubissues

openvpi / SOME

SOME: Singing-Oriented MIDI Extractor.

MIT License

392 stars 37 forks source link

transcription #5

Closed dutchsing009 closed 3 months ago

dutchsing009 commented 10 months ago

Can you please give an example of a transcriptions.csv file with name ph_seq, ph_dur and ph_num in it.

I want to see a reference file.

yqzhishen commented 10 months ago

If you have ever made DiffSinger datasets you should be familiar with transcriptions.csv. See https://github.com/openvpi/MakeDiffSinger, if you haven't done that before and want to learn more details. There is also a link to this SOME repository in https://github.com/openvpi/MakeDiffSinger/tree/main/variance-temp-solution, and you can understand everything once you reach that step.

dutchsing009 commented 10 months ago

1- Does this variance temp solution link work for English or French datasets ?

Ok Thanks , So if I understand this correctly , if I have ph_seq ph_dur ph_num I can Use SOME to get the midi sequence and midi duration sequence ? if yes I have 2 Questions

1- How can I obtain those 3 ph_seq , _dur, _num.? I saw 2 tools but I'm not sure if they will obtain those 3! https://github.com/wolfgitpr/LyricFA https://github.com/Anjiurine/fast-phasr-next Is there any other tool that will automatically generate me the Phoneme Sequence| Phoneme duration Sequence|Phoneme num?

2- How accurate are the generated midi sequence and midi duration sequence going to be ? like 100% ? ( I'm asking as if it isn't 100%, I think it will make the model hallucinate during SVS inference )

yqzhishen commented 10 months ago

ph_seq and ph_dur should be obtained when you finished making your DiffSinger acoustic dataset. Many tools and pipelines can do this. But as far as I know, ph_num can only be obtained by the method described in MakeDiffSinger repository, and unfortunately, there are no proper method of automatic ph_num inference for polysyllabic languages like English and French yet. However, I already have an idea to do this as described in https://github.com/openvpi/MakeDiffSinger/issues/11. If you have some suggestions you can comment on that issue.
The pretrained model of SOME is trained on pure Chinese datasets. Though SOME is language-irrelevant, it may not produce as good results as on its "native" language. But we do benefit from it for reducing the time cost of manual MIDI labeling, because of its ability to recognize slur notes and generate cent-level MIDI values.

dutchsing009 commented 9 months ago

does this help ? https://github.com/colstone/ENG_dur_num

yqzhishen commented 9 months ago

Yes, this can help, in some degree. But I doubt if simply specifying all vowels is enough and proper for polysyllabic languages. A more detailed discussion was raised here: https://github.com/openvpi/MakeDiffSinger/discussions/12