gitmylo / bark-voice-cloning-HuBERT-quantizer

The code for the bark-voicecloning model. Training and inference.
MIT License
671 stars 111 forks source link

Support for Hindi langauge #13

Open abhiprojectz opened 1 year ago

abhiprojectz commented 1 year ago

@gitmylo , hello I am currently trying to train the quantizer for on hindi dataset.

I need to know how much time would it take to train on a P100 GPU ? And also when should i stop the training ?

given that, I have dataset of approx 7000 wavs and semantic files.

I need to clarify once that will Hubert base model works well for Hindi language ?

gitmylo commented 1 year ago

with a dataset with 3000 files, 20 minutes on my rtx3060 had good results, i then trained it for an hour or so more. you can interrupt training at any point, and check your latest model to see how well it performs.

abhiprojectz commented 1 year ago

@gitmylo Thanks, I have trained for 15 epochs , and planning to do for 24.

However it took around 3 hours on a P100 GPU for just 15 epochs. and i reduced the files to around 6000.

Any suggestions to improve the cloning results and for faster trainings.

gitmylo commented 1 year ago

look at it like this. if you have 3000 files, and train for 24 epochs for example, it will still be worse than 6000 files for 12 epochs. an epoch means it has gone through every file. having more training data will make an epoch take longer.

also, if and when you decide to upload your model and/or training data to huggingface, please send the urls here, so i can add them to the readme

abhiprojectz commented 1 year ago

@gitmylo i think, Hubert base model doesn't support hindi language because my generated text doesn't speaks what's prompted with text , instead some random words and noises.

Given that,

I even used 2 types of models example.

Model_A tranied for 23 epochs on 3700 files (after ready stage)

Model_B trained for 16 epochs on 7783 files

Both yields poor results, please any suggestions , already i spent lot of time.

image

gitmylo commented 1 year ago

do the wavs used in training sound normal though?

abhiprojectz commented 1 year ago

@gitmylo yes, i just checked multiple wavs (BTW some files are pure noises too) in prepared folder and they sounds perfect.

Can you please suggest by your experience what shall i do?

If you may help, i may train for another langauges too,

gitmylo commented 1 year ago

maybe there's a Hindi HuBERT model somewhere, you could try loading it

abhiprojectz commented 1 year ago

@gitmylo , I could not found any searched a lot, it would be nice , if you may provide one link for it.

P.S: And a point to be noted that resources such as guides/pretrained_models etc for Hindi langauges are very rare.

a update Model_A crossed 32 epochs with losses as:

image

abhiprojectz commented 1 year ago

@gitmylo I assume there is problem with hubert base model doesn't supports hindi , as i checked with the generated semantic_prompt , i converted them to wav form (sematic_to_waveform) as they speaks some random words with english words, though the cloned speaker entirely speaks Hindi.

P.S cloned on 5-6 speakers, (clear voice) - Same poor results.

Conclusion is that after training for 35 epochs semantic vectors are not formed properly or in desired language.

Thanks, anyways, i will upload all , training data, both the models. But they are of no use.

abhiprojectz commented 1 year ago

A good news, i found a way of extracting semantic vectors from wav2vec models without the main hubert_base model.

gitmylo commented 1 year ago

great, as long as they're on the same rate with the same amount of features, it should work

JonathanFly commented 1 year ago

Is it distilhubert? And there's different versions around https://huggingface.co/models?search=distilhubert

I noticed it's on RVC too https://github.com/ddPn08/rvc-webui/pull/11

abhiprojectz commented 1 year ago

@gitmylo hey, i have one doubt , why haven't you used the hubert_base_ls960_L9_km500.bin quantizer ? And what's the reason of training for english language ??

gitmylo commented 1 year ago

I haven't used that quantizer because it is not compatible with bark. It uses completely different values to represent the semantic features.

I trained on english because english is the most widely spoken language in the world. And it's supported by bark.

abhiprojectz commented 1 year ago

@gitmylo thanks, just one last question,

Is it necessary to pass a input of size 768 to tokenizer, i mean that can we pass input of 1024 or something like to custom tokenizer ( A new that accepts input size of 1024) and then after tokenization will the result be compatible with bark or not ? That is the sematic tokens.

Consider My case is that, i am training a new tokenizer model with input size as 1024, and just need to confirm will the output be bark compatible or not ? Just am need to confirm with you ?


Extra info: My behind thought is that i found a pretrained well formed wav2vec2 model that i somehow to manage to extract semantic vector but output is of size 1024. So planning to train new tokenizer. Should i proceed or not ?

gitmylo commented 1 year ago

HuBERT wav2vec outputs have 768 features, that's why i picked that number, if you want to use a different number, pass input_size=1024 in the constructor

the default input shape is (B, 768) where B is the batch size, and output shape is (B, 1) with input_size=1024, the input shape is (B, 1024), and output shape is (B, 1)

example: on line 161 of customtokenizer.py, in auto_train, change model_training = CustomTokenizer(version=1).to('cuda') to model_training = CustomTokenizer(version=1, input_size=1024).to('cuda')

Make sure the Wav2Vec extracts features at the same rate as HuBERT too, else you'll get problems.

abhiprojectz commented 1 year ago

Thanks, Can you please shed lights on rate ? i mean what is the required rate ?

Make sure the Wav2Vec extracts features at the same rate as HuBERT too

For example, this indicwav2vec-hindi is trained on fairseq

gitmylo commented 1 year ago

about 50x768 features per second, or 50x1024 in your case. if it's slightly different, that's fine.

xiabo2011 commented 1 year ago

does hubert_base_ls960.pt pretrained only with English?

xiabo2011 commented 1 year ago

does hubert_base_ls960.pt pretrained only with English?

gitmylo commented 1 year ago

does hubert_base_ls960.pt pretrained only with English?

it seems to work with more than just english, not every single language though.

abhiprojectz commented 1 year ago

@gitmylo , On hubert training specs its seems its trained on librispeech_asr dataset which is a monolingual [english only] dataset.

Additionally its labelled only english .

Could you confirm , do quantizer or semantic_features returned from hubert model have anything to do with a language ?

https://huggingface.co/facebook/hubert-base-ls960

gitmylo commented 1 year ago

@gitmylo , On hubert training specs its seems its trained on librispeech_asr dataset which is a monolingual [english only] dataset.

Additionally its labelled only english .

Could you confirm , do quantizer or semantic_features returned from hubert model have anything to do with a language ?

https://huggingface.co/facebook/hubert-base-ls960

They do have something to do with a language, but that won't stop you from creating a good quantizer for a non-english language. as it is still able to recognize the patterns, as it's mostly human speech sounds anyways. It shouldn't be restricted to just english because the quantizer is english.

Subarasheese commented 1 year ago

@abhiprojectz I had success training the Portuguese language yesterday, and before that I was getting less than the ideal results (the model hallucinated a lot more and the voice clones sucked all around).

I used Hubert.

What I did was: 1 - Lowering the learning rate (in my case it was lowered to 0.0005 ) https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer/blob/master/hubert/customtokenizer.py#L56 2 - Redid the dataset, I threw in the Bible as religious texts apparently tend to produce cadenced speech more often and has a more formal language (better for tokens), and produced over 4000 files (4249 to be exact, but in your case for hindi you should probably use more) 3 - Trained 25 epochs (I selected the 24th epoch model as the best one, though) 4 - Tested each model to check which ones produce good audio + accurate cloned voices

Try lowering the learning rate as much as you can bearably can and let it train for several epochs until you find a sweet spot or if there is any noticeable change in audio generation.

Since you mentioned you where getting "random words and noises" I suggest you to select a learning rate of like 0.0001 and below in order to not "damage" the model

acul3 commented 1 year ago

@Subarasheese thanks for the insight,

just to clarify do you use hubert base right ? (https://dl.fbaipublicfiles.com/hubert/hubert_base_ls960.pt)

i'm trying to train hubert from scratch with multilngual(eng + indonesia) dataset,after reading this thread

but now i will try using hubert base first

Subarasheese commented 1 year ago

@Subarasheese thanks for the insight,

just to clarify do you use hubert base right ? (https://dl.fbaipublicfiles.com/hubert/hubert_base_ls960.pt)

i'm trying to train hubert from scratch with multilngual(eng + indonesia) dataset,after reading this thread

but now i will try using hubert base first

Yes, I used base Hubert. I would suggest you to just train over the base Hubert model before trying to train from the scratch.

sachaarbonel commented 1 year ago

I'm currently finetuning HuBERT on common_voice_11_0 hindi let's see how it goes

Surojit-KB commented 10 months ago

I'm currently finetuning HuBERT on common_voice_11_0 hindi let's see how it goes

Any update on this?

super-animo commented 7 months ago

Can someone share results on hindi cloning?