Hello. pretrained models are released to generate text to speech?

FurkanGozukara commented 1 year ago

Could you inform me about this? thank you

phamkhactu commented 1 year ago

@FurkanGozukara

Do you have any information?

TakashiShibuyaSony commented 1 year ago

Thank you for having an interest in our work! And, sorry for my late reply. We're thinking of releasing a pretrained checkpoint, but we're now conducting additional experiments. For example, we're investgating whether we can get better results by training a model for more than 1M steps. We will make a decision after we finish them (maybe in one or two months). Sorry for keeping you waiting.

FurkanGozukara commented 1 year ago

Thank you for having an interest in our work! And, sorry for my late reply. We're thinking of releasing a pretrained checkpoint, but we're now conducting additional experiments. For example, we're investgating whether we can get better results by training a model for more than 1M steps. We will make a decision after we finish them (maybe in one or two months). Sorry for keeping you waiting.

Thank you. Without pre trained model this repo is useless unfortunately :/ if you also decide to release model please also make a simple Gradio interface too

TakashiShibuyaSony commented 1 year ago

We've just released pretrained models. We'd be glad if this would be useful. Thanks!

FurkanGozukara commented 1 year ago

We've just released pretrained models. We'd be glad if this would be useful. Thanks!

thank you so much amazing

can we fine tune a voice on it to generate that new voice? how many steps would it take or how hard it would be?

TakashiShibuyaSony commented 1 year ago

@FurkanGozukara We've never fine-tuned a pretrained BigVSAN model, and then I have no idea on how many steps are necessary. The following is just my guess.

BigVGAN, which our BigVSAN is built based on, was proposed as a universal vocoder. This means it can generate voices of unseen speakers quite well without fine-tuning. Actually, our evaluation shows BigVSAN can generate voices of unseen speakers very well without fine-tuning as well. You can check that on our demo page, where every voice provided is from an unseen speaker. But, of course, fine-tuning will enable a model to generate higher-fidelity speech of a specific speaker. How many steps are required depends on the size of your dataset, but 0.1M may be sufficient because a pretrained model can already generate voices of unseen speakers quite well.

Best regards.

FurkanGozukara commented 1 year ago

@FurkanGozukara We've never fine-tuned a pretrained BigVSAN model, and then I have no idea on how many steps are necessary. The following is just my guess.

BigVGAN, which our BigVSAN is built based on, was proposed as a universal vocoder. This means it can generate voices of unseen speakers quite well without fine-tuning. Actually, our evaluation shows BigVSAN can generate voices of unseen speakers very well without fine-tuning as well. You can check that on our demo page, where every voice provided is from an unseen speaker. But, of course, fine-tuning will enable a model to generate higher-fidelity speech of a specific speaker. How many steps are required depends on the size of your dataset, but 0.1M may be sufficient because a pretrained model can already generate voices of unseen speakers quite well.

Best regards.

OK 100k steps sounding great

Any tutorial documentation about how to find tune and prepare training data?

My aim is obtaining eleven labs quality with fine tuning

TakashiShibuyaSony commented 1 year ago

Put the pretrained model files g_10000000 and do_10000000 at exp/bigvsan.
Create a directory for your data, e.g. YourData
Make filename lists imitating LibriTTS/train-full.txt and LibriTTS/val-full.txt. If you don't have a validation split, making only a training file list and copying it is OK.
Place the filename list at your created directory YourData

Modify the following lines https://github.com/sony/bigvsan/blob/ea179e833d6cdb9ef599e2868df9a18ce0b27443/train.py#L399-L400 as follows:

parser.add_argument('--list_input_unseen_wavs_dir', default=[]) 
parser.add_argument('--list_input_unseen_validation_file', default=[])

Run the following script (almost the same as https://github.com/sony/bigvsan#training)

python train.py \
--config configs/bigvsan_24khz_100band.json \
--input_wavs_dir YourData \
--input_training_file YourData/train-full.txt \
--input_validation_file YourData/val-full.txt \
--checkpoint_path exp/bigvsan

If you don't have a validation split, adding the --debug True option makes training faster.

You will get a fine-tuned generator g_10100000 (and do_10100000) after 100k steps.

Please note that this implementation is being provided basically for research purposes. There should be room for improvement in the implementation from the perspective of practitioners. Anyway, thank you for your interest in our work!

FurkanGozukara commented 1 year ago

3. LibriTTS/train-full.txt

thank you so much here few questions

in YourData folder there will be wav files. any format for them? like they should be between 5 second to 15 seconds etc? anything else?

i assume LibriTTS/train-full.txt will have file names to the files inside YourData folder and nothing else

just like

a (1).wav a (2).wav a (3).wav a (4).wav

Do we need transcription of wav files or just speaking files with no text file? thank you so much for answers

By the way I just noticed something. Your model can't generate voice from text right?

ex3ndr commented 8 months ago

Sorry but where is a pretrained weights?

TakashiShibuyaSony commented 8 months ago

@ex3ndr Thank you for your interest!

The details are here, https://github.com/sony/bigvsan?tab=readme-ov-file#pretrained-models. You can download a pretrained model here, https://zenodo.org/records/10037439.

sony / bigvsan

Hello. pretrained models are released to generate text to speech? #2