Closed blueyred closed 11 months ago
A word can contain multiple notes (all except the first one are slurs), but a note cannot cover more than one word.
If note_slur
is all 0, this means there are no slurs - then the number of notes should equal the number of words.
If note_slur
is all 1 - this is not possible, because slur notes cannot appear as the first note of any words.
Also, is there a way to run phoneme timing predictions that aren’t based on note lengths and note slurs? Could we pass in the total sum of the note lengths and let the model make a timing prediction for the full phrase across the full length of notes?
This sounds more like what TTS systems do. Or is this called freestyle in singing?
SVS systems are based on music scores, and music scores should contain notes with fixed lengths. But if you want to try out what you say, why not consider the whole sentence as one single word? Phonemes can be divided into words freely via ph_num
.
I definitely had the wrong idea about how to use note_slur, thanks for clearing it up, it makes a lot more sense now. Ah, that's a great solution for freestyle! I will give it a go, thanks.
Hey there! I’m having a little issue with phoneme duration predictions and note slurs.
When I run variance inference, if “note_slur” is all zeros or all ones, the process crashes. It’s like hitting a wrong note! 🎵🎶
It doesn’t always happen, just when ph_num & note_slur are certain sizes (note slur is larger than ph_num I think).
Here’s an example that crashes for me:
The error message is:
In this file - inference/ds_variance.py
Maybe I’m missing something? Like the number of notes should match the length of ph_num? 🤔
Also, is there a way to run phoneme timing predictions that aren’t based on note lengths and note slurs? Could we pass in the total sum of the note lengths and let the model make a timing prediction for the full phrase across the full length of notes?
Thanks for all your hard work on this!