lifeiteng / vall-e

PyTorch implementation of VALL-E(Zero-Shot Text-To-Speech), Reproduced Demo https://lifeiteng.github.io/valle/index.html
https://lifeiteng.github.io/valle/index.html
Apache License 2.0
1.99k stars 320 forks source link

When training, how to cut audio prompt? #143

Closed sjoon2455 closed 1 year ago

sjoon2455 commented 1 year ago

Hi, I have a question about training stage.

As I understand, Vall-e cuts a given audio for about 3 seconds (if possible) and have it act as an audio prompt (with is aligned phoneme)

What if a given audio is less than 3 seconds long?

christallire commented 1 year ago

PAD

sjoon2455 commented 1 year ago

@christallire Thanks, do you know where in the code does that?

keshawnhsieh commented 1 year ago

For ar stage, there is no audio prompt needed. The speech part will be regarded as a Auto-regressive Task and you can refer to paper's description of this design.

In the AR model, we do not explicitly extract an audio clip as the prompt in training. The training process is pure casual language model training. In this way, any prefix sequence c<t,1 is treated as a prompt for the latter part of the sequence c≥t,1.

For nar stage, if you use the default setup with prefix mode == 1, audio prompt will be seleceted from two candidates. One is a audio clip with length sampled from range of 1/4 orignal audio length to 1/2 audio length. Another is 3 sec audio clip from start. The shorter one will be used. You can refer to code here. https://github.com/lifeiteng/vall-e/blob/e38ac9d816412317222b7f0f59198b9124377da2/valle/models/valle.py#L348-L350

sjoon2455 commented 1 year ago

@keshawnhsieh Thanks you so much, I really appreciate your explanation.