Open howitry opened 6 days ago
When using the 40 tokens/s configuration, although the quality of the reconstructed audio is very good, there are often some mispronunciations. Have you measured the CER performance of the reconstructed audio?
The CER or WER results seem satisfactory from our experiments. Could you kindly provide further experimental details, such as whether you used the WavTokenizer-small or WavTokenizer-medium version? Additionally, on which test set were the evaluations conducted? Please note that the WavTokenizer-small version has very limited generalization capability.
When using the 40 tokens/s configuration, although the quality of the reconstructed audio is very good, there are often some mispronunciations. Have you measured the CER performance of the reconstructed audio?
The CER or WER results seem satisfactory from our experiments. Could you kindly provide further experimental details, such as whether you used the WavTokenizer-small or WavTokenizer-medium version? Additionally, on which test set were the evaluations conducted? Please note that the WavTokenizer-small version has very limited generalization capability.
When using the 40 tokens/s configuration, although the quality of the reconstructed audio is very good, there are often some mispronunciations. Have you measured the CER performance of the reconstructed audio?
The CER or WER results seem satisfactory from our experiments. Could you kindly provide further experimental details, such as whether you used the WavTokenizer-small or WavTokenizer-medium version? Additionally, on which test set were the evaluations conducted? Please note that the WavTokenizer-small version has very limited generalization capability.
I trained the wavtokenizer on about 60,000 hours of data, with a 1:1 ratio of English to Chinese data. I have trained for 3 epochs so far, and when checking the reconstruction of Chinese, I found some incorrect pronunciations.
When using the 40 tokens/s configuration, although the quality of the reconstructed audio is very good, there are often some mispronunciations. Have you measured the CER performance of the reconstructed audio?
The CER or WER results seem satisfactory from our experiments. Could you kindly provide further experimental details, such as whether you used the WavTokenizer-small or WavTokenizer-medium version? Additionally, on which test set were the evaluations conducted? Please note that the WavTokenizer-small version has very limited generalization capability.
When using the 40 tokens/s configuration, although the quality of the reconstructed audio is very good, there are often some mispronunciations. Have you measured the CER performance of the reconstructed audio?
The CER or WER results seem satisfactory from our experiments. Could you kindly provide further experimental details, such as whether you used the WavTokenizer-small or WavTokenizer-medium version? Additionally, on which test set were the evaluations conducted? Please note that the WavTokenizer-small version has very limited generalization capability.
I trained the wavtokenizer on about 60,000 hours of data, with a 1:1 ratio of English to Chinese data. I have trained for 3 epochs so far, and when checking the reconstruction of Chinese, I found some incorrect pronunciations.
Training for only three epochs seems insufficient. Since the data is randomly sampled during training, it means that a full pass through the dataset has not yet been completed. Extending the training to 12-24 epochs could potentially yield better results.
After restoring our own Korean speech data using the WavTokenizer-medium-speech-75token checkpoint and measuring the CER, there was a significant drop in performance. Could you share the CER or WER comparison results you conducted?
In our experiment, we obtained the following results:
After restoring our own Korean speech data using the WavTokenizer-medium-speech-75token checkpoint and measuring the CER, there was a significant drop in performance. Could you share the CER or WER comparison results you conducted?
In our experiment, we obtained the following results:
- GroundTruth CER: 4.3978%
- Reconstructed CER: 11.8222%
The wavTokenizer-medium-speech model was trained on a very limited amount of Korean data, making this phenomenon reasonable. You may consider testing the WER or CER on the English test set (LibriTTS testclean). Additionally, retraining a version of WavTokenizer with Korean data is likely to yield significantly improved performance.
When using the 40 tokens/s configuration, although the quality of the reconstructed audio is very good, there are often some mispronunciations. Have you measured the CER performance of the reconstructed audio?