Closed FurkanGozukara closed 1 year ago
@FurkanGozukara
Do you have any information?
Thank you for having an interest in our work! And, sorry for my late reply. We're thinking of releasing a pretrained checkpoint, but we're now conducting additional experiments. For example, we're investgating whether we can get better results by training a model for more than 1M steps. We will make a decision after we finish them (maybe in one or two months). Sorry for keeping you waiting.
Thank you for having an interest in our work! And, sorry for my late reply. We're thinking of releasing a pretrained checkpoint, but we're now conducting additional experiments. For example, we're investgating whether we can get better results by training a model for more than 1M steps. We will make a decision after we finish them (maybe in one or two months). Sorry for keeping you waiting.
Thank you. Without pre trained model this repo is useless unfortunately :/ if you also decide to release model please also make a simple Gradio interface too
We've just released pretrained models. We'd be glad if this would be useful. Thanks!
We've just released pretrained models. We'd be glad if this would be useful. Thanks!
thank you so much amazing
can we fine tune a voice on it to generate that new voice? how many steps would it take or how hard it would be?
@FurkanGozukara We've never fine-tuned a pretrained BigVSAN model, and then I have no idea on how many steps are necessary. The following is just my guess.
BigVGAN, which our BigVSAN is built based on, was proposed as a universal vocoder. This means it can generate voices of unseen speakers quite well without fine-tuning. Actually, our evaluation shows BigVSAN can generate voices of unseen speakers very well without fine-tuning as well. You can check that on our demo page, where every voice provided is from an unseen speaker. But, of course, fine-tuning will enable a model to generate higher-fidelity speech of a specific speaker. How many steps are required depends on the size of your dataset, but 0.1M may be sufficient because a pretrained model can already generate voices of unseen speakers quite well.
Best regards.
@FurkanGozukara We've never fine-tuned a pretrained BigVSAN model, and then I have no idea on how many steps are necessary. The following is just my guess.
BigVGAN, which our BigVSAN is built based on, was proposed as a universal vocoder. This means it can generate voices of unseen speakers quite well without fine-tuning. Actually, our evaluation shows BigVSAN can generate voices of unseen speakers very well without fine-tuning as well. You can check that on our demo page, where every voice provided is from an unseen speaker. But, of course, fine-tuning will enable a model to generate higher-fidelity speech of a specific speaker. How many steps are required depends on the size of your dataset, but 0.1M may be sufficient because a pretrained model can already generate voices of unseen speakers quite well.
Best regards.
OK 100k steps sounding great
Any tutorial documentation about how to find tune and prepare training data?
My aim is obtaining eleven labs quality with fine tuning
Put the pretrained model files g_10000000
and do_10000000
at exp/bigvsan
.
Create a directory for your data, e.g. YourData
Make filename lists imitating LibriTTS/train-full.txt
and LibriTTS/val-full.txt
. If you don't have a validation split, making only a training file list and copying it is OK.
Place the filename list at your created directory YourData
Modify the following lines https://github.com/sony/bigvsan/blob/ea179e833d6cdb9ef599e2868df9a18ce0b27443/train.py#L399-L400 as follows:
parser.add_argument('--list_input_unseen_wavs_dir', default=[])
parser.add_argument('--list_input_unseen_validation_file', default=[])
Run the following script (almost the same as https://github.com/sony/bigvsan#training)
python train.py \
--config configs/bigvsan_24khz_100band.json \
--input_wavs_dir YourData \
--input_training_file YourData/train-full.txt \
--input_validation_file YourData/val-full.txt \
--checkpoint_path exp/bigvsan
If you don't have a validation split, adding the --debug True
option makes training faster.
You will get a fine-tuned generator g_10100000
(and do_10100000
) after 100k steps.
Please note that this implementation is being provided basically for research purposes. There should be room for improvement in the implementation from the perspective of practitioners. Anyway, thank you for your interest in our work!
3. LibriTTS/train-full.txt
thank you so much here few questions
in YourData folder there will be wav files. any format for them? like they should be between 5 second to 15 seconds etc? anything else?
i assume LibriTTS/train-full.txt will have file names to the files inside YourData folder and nothing else
just like
a (1).wav a (2).wav a (3).wav a (4).wav
Do we need transcription of wav files or just speaking files with no text file? thank you so much for answers
By the way I just noticed something. Your model can't generate voice from text right?
Sorry but where is a pretrained weights?
@ex3ndr Thank you for your interest!
The details are here, https://github.com/sony/bigvsan?tab=readme-ov-file#pretrained-models. You can download a pretrained model here, https://zenodo.org/records/10037439.
Could you inform me about this? thank you