lifeiteng / vall-e

PyTorch implementation of VALL-E(Zero-Shot Text-To-Speech), Reproduced Demo https://lifeiteng.github.io/valle/index.html
https://lifeiteng.github.io/valle/index.html
Apache License 2.0
1.99k stars 320 forks source link

Working on wenetspeech, help needed. #141

Closed keshawnhsieh closed 1 year ago

keshawnhsieh commented 1 year ago

I wanna to expriment with wenetspeech as it contains 10K hours mandarin speech data and hopefully we'll get much better result than aishell1. However I got stucked with some silly problems while preparing wenetspeech with hotse. I just made some minor modifications to recipe of aishell to adapt to wenetspeech, like replacing string aishell with wenetspeech.

The first problem I met was that the gpu memory was easily out of usage while running into the step 2 (bin/tokenizer.py). After switching to a gpu equipped much larger memory (80G), it was solved.

The second problem is caused by Encodec while extracting audio features. Shown below.

Traceback (most recent call last):
  File "/home/keshawnhsieh/vall-e/egs/wenetspeech/bin/tokenizer.py", line 262, in <module>
    main()
  File "/home/keshawnhsieh/vall-e/egs/wenetspeech/bin/tokenizer.py", line 198, in main
    cut_set = cut_set.compute_and_store_features_batch(
  File "/home/keshawnhsieh/miniconda3/envs/vall-e/lib/python3.10/site-packages/lhotse/cut/set.py", line 2294, in compute_and_store_features_batch
    features = extractor.extract_batch(
  File "/home/keshawnhsieh/vall-e/valle/data/tokenizer.py", line 348, in extract_batch
    encoded_frames = self.tokenizer.encode(samples.detach().to(device))
  File "/home/keshawnhsieh/vall-e/valle/data/tokenizer.py", line 239, in encode
    return self.codec.encode(wav.to(self.device))
  File "/home/keshawnhsieh/miniconda3/envs/vall-e/lib/python3.10/site-packages/encodec/model.py", line 144, in encode
    encoded_frames.append(self._encode_frame(frame))
  File "/home/keshawnhsieh/miniconda3/envs/vall-e/lib/python3.10/site-packages/encodec/model.py", line 161, in _encode_frame
    emb = self.encoder(x)
  File "/home/keshawnhsieh/miniconda3/envs/vall-e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/keshawnhsieh/miniconda3/envs/vall-e/lib/python3.10/site-packages/encodec/modules/seanet.py", line 144, in forward
    return self.model(x)
  File "/home/keshawnhsieh/miniconda3/envs/vall-e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/keshawnhsieh/miniconda3/envs/vall-e/lib/python3.10/site-packages/torch/nn/modules/container.py", line 204, in forward
    input = module(input)
  File "/home/keshawnhsieh/miniconda3/envs/vall-e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/keshawnhsieh/miniconda3/envs/vall-e/lib/python3.10/site-packages/encodec/modules/seanet.py", line 63, in forward
    return self.shortcut(x) + self.block(x)
  File "/home/keshawnhsieh/miniconda3/envs/vall-e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/keshawnhsieh/miniconda3/envs/vall-e/lib/python3.10/site-packages/encodec/modules/conv.py", line 204, in forward
    x = pad1d(x, (padding_total, extra_padding), mode=self.pad_mode)
  File "/home/keshawnhsieh/miniconda3/envs/vall-e/lib/python3.10/site-packages/encodec/modules/conv.py", line 92, in pad1d
    padded = F.pad(x, paddings, mode, value)
RuntimeError: input tensor must fit into 32-bit index math

Did anyone ever meet this problem when working with wenetspeech ?

lifeiteng commented 1 year ago

wav should be float value with range [-1.0, 1.0)

no-Seaweed commented 1 year ago

I have faced the problem. I think you inputa wrong format of audio into the tokenizer. Try to compare the min max shape etc of input between aishell and wenet, I believe you will find the problem.

keshawnhsieh commented 1 year ago

Thanks all. I believe that the wenet_speech recipe in lhotse has problem some how. Now I convert wenetspeech opus into wav first and then feed them into lhotse, it works.

70557dzqc commented 1 year ago

Thanks all. I believe that the wenet_speech recipe in lhotse has problem some how. Now I convert wenetspeech opus into wav first and then feed them into lhotse, it works.

how is the performance of your model which trained in wenetspeech dataset?

caffeinetoomuch commented 1 year ago

I have faced the problem. I think you inputa wrong format of audio into the tokenizer. Try to compare the min max shape etc of input between aishell and wenet, I believe you will find the problem.

I am having the same issue with librilight large dataset. What should be the fix for this? Should I manually filter those failing ones out?

SaltedSlark commented 1 year ago

@keshawnhsieh hi, can u share ur revised scripts for wenetspeech datasets (include prepare.sh and tokenizer.py)? much thanks!

keshawnhsieh commented 1 year ago

Thanks all. I believe that the wenet_speech recipe in lhotse has problem some how. Now I convert wenetspeech opus into wav first and then feed them into lhotse, it works.

how is the performance of your model which trained in wenetspeech dataset?

It's not good enough as expected. The model trained on wenetspeech L set can mimic the timbre if given audio prompt from dev&test set or audio books which can be regarded as easy case. However it performs terribly on my own voice recorded by my cell phone.

keshawnhsieh commented 1 year ago

I have faced the problem. I think you inputa wrong format of audio into the tokenizer. Try to compare the min max shape etc of input between aishell and wenet, I believe you will find the problem.

I am having the same issue with librilight large dataset. What should be the fix for this? Should I manually filter those failing ones out?

Not sure about librilight, but for wenetspeech the problem lies in the data organization. The wenet_speech recipe in lhotse stores all the opus in so called RecordingSet and the current tokenizing scritp in vall-e do the Encodec feature extraction directlly on these opus files included in RecordingSet which cause GPU memory blows out. So to fix this problem, there are two solutions:

  1. convert wenetspeech opus to wav at the very beginning and then generate RecordingSet stores processed wavs rather than original opus.
  2. modify the code of tokenizer to make it support on-the-fly extracting segment from opus according both RecordingSet and SupervisionSet.
keshawnhsieh commented 1 year ago

@keshawnhsieh hi, can u share ur revised scripts for wenetspeech datasets (include prepare.sh and tokenizer.py)? much thanks!

Actually, I just converted all opus in wenetspeech to wav and handled them as we did with aishell dataset. However, it will generate lots of small piece wav files and is not friendly to hdd with slow io speed. You can also consider store them in tar file like wenet did, it will dramatically save much preparing time.

SaltedSlark commented 1 year ago

@keshawnhsieh hi, can u share ur revised scripts for wenetspeech datasets (include prepare.sh and tokenizer.py)? much thanks!

Actually, I just converted all opus in wenetspeech to wav and handled them as we did with aishell dataset. However, it will generate lots of small piece wav files and is not friendly to hdd with slow io speed. You can also consider store them in tar file like wenet did, it will dramatically save much preparing time.

Thanks for ur reply ! one more question, after converted, shall I revise any code on wenet recipe in lhotse? will it find my wavs file from specific path(same as opus path) automatically? further more, if i want to filter some short audios, is it enough to just remove these wav files? much love for replying again!

keshawnhsieh commented 1 year ago

@keshawnhsieh hi, can u share ur revised scripts for wenetspeech datasets (include prepare.sh and tokenizer.py)? much thanks!

Actually, I just converted all opus in wenetspeech to wav and handled them as we did with aishell dataset. However, it will generate lots of small piece wav files and is not friendly to hdd with slow io speed. You can also consider store them in tar file like wenet did, it will dramatically save much preparing time.

Thanks for ur reply ! one more question, after converted, shall I revise any code on wenet recipe in lhotse? will it find my wavs file from specific path(same as opus path) automatically? further more, if i want to filter some short audios, is it enough to just remove these wav files? much love for replying again!

I will advise you to follow the structure of aishell dataset while you converting opus to wav. Then you can directly use aishell recipe in Lhotse rather than wenet_speech recipe. You can leave your wechat here if you need more discussion :)

SaltedSlark commented 1 year ago

@keshawnhsieh hi, can u share ur revised scripts for wenetspeech datasets (include prepare.sh and tokenizer.py)? much thanks!

Actually, I just converted all opus in wenetspeech to wav and handled them as we did with aishell dataset. However, it will generate lots of small piece wav files and is not friendly to hdd with slow io speed. You can also consider store them in tar file like wenet did, it will dramatically save much preparing time.

Thanks for ur reply ! one more question, after converted, shall I revise any code on wenet recipe in lhotse? will it find my wavs file from specific path(same as opus path) automatically? further more, if i want to filter some short audios, is it enough to just remove these wav files? much love for replying again!

I will advise you to follow the structure of aishell dataset while you converting opus to wav. Then you can directly use aishell recipe in Lhotse rather than wenet_speech recipe. You can leave your wechat here if you need more discussion :)

My wechat HONEYTYC, thank u so much again!

kingmpw2015 commented 11 months ago

@keshawnhsieh @SaltedSlark Did you later achieve better results in the model experiments on the WenetSpeech dataset? Did it not show any improvement compared to the Aishell dataset? This is my wechat kingmpw, feel free to discuss!

keshawnhsieh commented 11 months ago

@keshawnhsieh @SaltedSlark Did you later achieve better results in the model experiments on the WenetSpeech dataset? Did it not show any improvement compared to the Aishell dataset? This is my wechat kingmpw, feel free to discuss!

Only the experiment on orignal wenetspeech 10k hours data conducted yet. The result is as summaried before, not good enough. To be more precisely, terrible prosody and unstable timbre.

The experiment that is hoping to be done in next few days is trained on enhanced wenetspeech 10k hours data. All audios was enhanced by Voicefixer to remove backgrond noise/music before being used to train vall-e. Glad to share these further results with u. @kingmpw2015 Added u on the wechat :)

Also another larger-scale experiment which is fed with more than 40k hours audio data crawled from himalaya audio book is ongoing. Due to the source limitation, the training shall be finished in month approximately.

potatoker commented 10 months ago

hi, have you done the enhanced 10k wenetspeech training and 40k himalaya training and what about the newly inference result? It will be really appreciated that further process is shared~

decajcd commented 2 months ago

hi, have you done the enhanced 10k wenetspeech training and 40k himalaya training and what about the newly inference result? It will be really appreciated that further process is shared~

hi, have you done the enhanced 10k wenetspeech training and 40k himalaya training and what about the newly inference result? It will be really appreciated that further process is shared~