jishengpeng / WavTokenizer

SOTA discrete acoustic codec models with 40 tokens per second for audio language modeling
MIT License
667 stars 39 forks source link

CER Performance of Reconstructed Audio #34

Open howitry opened 6 days ago

howitry commented 6 days ago

When using the 40 tokens/s configuration, although the quality of the reconstructed audio is very good, there are often some mispronunciations. Have you measured the CER performance of the reconstructed audio?

jishengpeng commented 6 days ago

When using the 40 tokens/s configuration, although the quality of the reconstructed audio is very good, there are often some mispronunciations. Have you measured the CER performance of the reconstructed audio?

The CER or WER results seem satisfactory from our experiments. Could you kindly provide further experimental details, such as whether you used the WavTokenizer-small or WavTokenizer-medium version? Additionally, on which test set were the evaluations conducted? Please note that the WavTokenizer-small version has very limited generalization capability.

howitry commented 6 days ago

When using the 40 tokens/s configuration, although the quality of the reconstructed audio is very good, there are often some mispronunciations. Have you measured the CER performance of the reconstructed audio?

The CER or WER results seem satisfactory from our experiments. Could you kindly provide further experimental details, such as whether you used the WavTokenizer-small or WavTokenizer-medium version? Additionally, on which test set were the evaluations conducted? Please note that the WavTokenizer-small version has very limited generalization capability.

When using the 40 tokens/s configuration, although the quality of the reconstructed audio is very good, there are often some mispronunciations. Have you measured the CER performance of the reconstructed audio?

The CER or WER results seem satisfactory from our experiments. Could you kindly provide further experimental details, such as whether you used the WavTokenizer-small or WavTokenizer-medium version? Additionally, on which test set were the evaluations conducted? Please note that the WavTokenizer-small version has very limited generalization capability.

I trained the wavtokenizer on about 60,000 hours of data, with a 1:1 ratio of English to Chinese data. I have trained for 3 epochs so far, and when checking the reconstruction of Chinese, I found some incorrect pronunciations.

jishengpeng commented 6 days ago

When using the 40 tokens/s configuration, although the quality of the reconstructed audio is very good, there are often some mispronunciations. Have you measured the CER performance of the reconstructed audio?

The CER or WER results seem satisfactory from our experiments. Could you kindly provide further experimental details, such as whether you used the WavTokenizer-small or WavTokenizer-medium version? Additionally, on which test set were the evaluations conducted? Please note that the WavTokenizer-small version has very limited generalization capability.

When using the 40 tokens/s configuration, although the quality of the reconstructed audio is very good, there are often some mispronunciations. Have you measured the CER performance of the reconstructed audio?

The CER or WER results seem satisfactory from our experiments. Could you kindly provide further experimental details, such as whether you used the WavTokenizer-small or WavTokenizer-medium version? Additionally, on which test set were the evaluations conducted? Please note that the WavTokenizer-small version has very limited generalization capability.

I trained the wavtokenizer on about 60,000 hours of data, with a 1:1 ratio of English to Chinese data. I have trained for 3 epochs so far, and when checking the reconstruction of Chinese, I found some incorrect pronunciations.

Training for only three epochs seems insufficient. Since the data is randomly sampled during training, it means that a full pass through the dataset has not yet been completed. Extending the training to 12-24 epochs could potentially yield better results.

YoungloLee commented 16 hours ago

After restoring our own Korean speech data using the WavTokenizer-medium-speech-75token checkpoint and measuring the CER, there was a significant drop in performance. Could you share the CER or WER comparison results you conducted?

In our experiment, we obtained the following results:

jishengpeng commented 9 hours ago

After restoring our own Korean speech data using the WavTokenizer-medium-speech-75token checkpoint and measuring the CER, there was a significant drop in performance. Could you share the CER or WER comparison results you conducted?

In our experiment, we obtained the following results:

  • GroundTruth CER: 4.3978%
  • Reconstructed CER: 11.8222%

The wavTokenizer-medium-speech model was trained on a very limited amount of Korean data, making this phenomenon reasonable. You may consider testing the WER or CER on the English test set (LibriTTS testclean). Additionally, retraining a version of WavTokenizer with Korean data is likely to yield significantly improved performance.