lifeiteng / vall-e

PyTorch implementation of VALL-E(Zero-Shot Text-To-Speech), Reproduced Demo https://lifeiteng.github.io/valle/index.html
https://lifeiteng.github.io/valle/index.html
Apache License 2.0
1.99k stars 320 forks source link

Missing at beginning for inferencing #186

Closed tangjjbetsy closed 3 months ago

tangjjbetsy commented 3 months ago

Hi, I've adapted the model for music midi synthesising but encountered some issues in inferencing. The training prefix mode is 1. Given the input as

midi_prompt (similar to text_prompt, 3s) 
midi (text, for synthesising) 
audio_prompt (3s)

The synthesised audio will always have 3s missing at the beginning. If I input longer audio_prompt, then the same missing at the beginning but shorter generation lengths.

For example, If I input 3s audio prompt with the midi, the AR decoder will output:

VALL-E EOS [225 -> 975]

The resulted audio starts from the 3s of the midi instead of the beginning. If I input longer audio prompt, the 975 remains the same, the 225 might change to 500 or 600, while the synthesised audio still starts from the 3s of the midi but ends earlier.

I am wondering whether such kind of issue is caused by the prompts I prepared, or maybe there are some issues with the inferencing codes?

lastapple commented 1 month ago

Did you find out what the problem was later?

tangjjbetsy commented 1 month ago

Yes, the problem was caused by my tokenisation and segmentation of the midi files. A time-relavant feature bar was used to represent the position of each note, making the model learn to stop as the bar value increases.