Phoneme duration predictions / note slur issue

blueyred commented 12 months ago

Hey there! I’m having a little issue with phoneme duration predictions and note slurs.
When I run variance inference, if “note_slur” is all zeros or all ones, the process crashes. It’s like hitting a wrong note! 🎵🎶

It doesn’t always happen, just when ph_num & note_slur are certain sizes (note slur is larger than ph_num I think).
Here’s an example that crashes for me:

[
  {
    "offset": 0.32,
    "text": "remember when you were young",
    "ph_seq": "AP r ih m eh m b er w eh n y uw w er y ah ng AP",
    "ph_dur": "",
    "ph_num": "3 2 3 2 3 2 2 2",
    "note_seq": "C#4 D4 C#4 C4 D4 C4 D4 C4 D4 C4",
    "note_dur": "1.75 1.3 1.66 1.15 1.82 1.75 1.82 1.75 1.82 1.75",
    "note_slur": "0 0 0 0 0 0 0 0 0 0",
    "f0_seq": "",
    "f0_timestep": ""
  }
]

The error message is:

File "/home/david/Vocal/DiffRefactor/inference/ds_variance.py", in preprocess_input
    word_dur = note_dur.new_zeros(1, T_w + 1).scatter_add(
RuntimeError: index 9 is out of bounds for dimension 1 with size 9

In this file - inference/ds_variance.py

Maybe I’m missing something? Like the number of notes should match the length of ph_num? 🤔

Also, is there a way to run phoneme timing predictions that aren’t based on note lengths and note slurs? Could we pass in the total sum of the note lengths and let the model make a timing prediction for the full phrase across the full length of notes?

Thanks for all your hard work on this!

yqzhishen commented 12 months ago

A word can contain multiple notes (all except the first one are slurs), but a note cannot cover more than one word.

If note_slur is all 0, this means there are no slurs - then the number of notes should equal the number of words.

If note_slur is all 1 - this is not possible, because slur notes cannot appear as the first note of any words.

yqzhishen commented 12 months ago

Also, is there a way to run phoneme timing predictions that aren’t based on note lengths and note slurs? Could we pass in the total sum of the note lengths and let the model make a timing prediction for the full phrase across the full length of notes?

This sounds more like what TTS systems do. Or is this called freestyle in singing?

SVS systems are based on music scores, and music scores should contain notes with fixed lengths. But if you want to try out what you say, why not consider the whole sentence as one single word? Phonemes can be divided into words freely via ph_num.

blueyred commented 12 months ago

I definitely had the wrong idea about how to use note_slur, thanks for clearing it up, it makes a lot more sense now. Ah, that's a great solution for freestyle! I will give it a go, thanks.

openvpi / DiffSinger

Phoneme duration predictions / note slur issue #112