jishengpeng / WavTokenizer

SOTA discrete acoustic codec models with 40 tokens per second for audio language modeling
MIT License
787 stars 43 forks source link

worse performance of large model compared to small model? #54

Open XiaoshanHsj opened 1 day ago

XiaoshanHsj commented 1 day ago

Thank you for doing such great work and open-sourcing it.

I use the large model (WavTokenizer-large-320-24k-4096) to reconstruct audio of LibriTTS. However, the results are worse than those reported in paper, which used the small model.

It is UTMOS_raw 19604.11721920967 4.056303997353543 UTMOS_encodec 19604.11721920967 3.8397375189096272 PESQ: 9956.64894938469 2.060138412866685 F1_score: 4432.935466635334 0.917602042358794 2 STOI: 0.8924008398453133

While in paper, it is UTMOS_encodec 4.0486 PESQ 2.3730 STOI 0.9139

Is it exceptd for the performance to degrade?

Thanks~

XiaoshanHsj commented 1 day ago

the test set is test-clean of LibriTTS, and the number of samples is 4833

jishengpeng commented 1 day ago

Thank you for doing such great work and open-sourcing it.

I use the large model (WavTokenizer-large-320-24k-4096) to reconstruct audio of LibriTTS. However, the results are worse than those reported in paper, which used the small model.

It is UTMOS_raw 19604.11721920967 4.056303997353543 UTMOS_encodec 19604.11721920967 3.8397375189096272 PESQ: 9956.64894938469 2.060138412866685 F1_score: 4432.935466635334 0.917602042358794 2 STOI: 0.8924008398453133

While in paper, it is UTMOS_encodec 4.0486 PESQ 2.3730 STOI 0.9139

Is it exceptd for the performance to degrade?

Thanks~

Due to the significant increase in generalization capabilities of large models, I observed a slight performance drop on the LibriTTS test-clean dataset (though the difference is minimal). However, your results may also be influenced by other factors, such as cuda version, and it seems that four entries are missing from your test set. Moreover, subject evaluation may be also important. Thank you~

XiaoshanHsj commented 1 day ago

thanks for your reply, I am using the small model to reconstruct the wavforms. The results are:

UTMOS_raw 19604.11721920967 4.056303997353543 UTMOS_encodec 19604.11721920967 3.9794073770832084 PESQ: 11974.47469329834 2.477648395054488 F1_score: 4487.17120589376 0.9290209536011925 3 STOI: 0.9199737990446866

jishengpeng commented 1 day ago

thanks for your reply, I am using the small model to reconstruct the wavforms. The results are:

UTMOS_raw 19604.11721920967 4.056303997353543 UTMOS_encodec 19604.11721920967 3.9794073770832084 PESQ: 11974.47469329834 2.477648395054488 F1_score: 4487.17120589376 0.9290209536011925 3 STOI: 0.9199737990446866

ok, It appears that the results exhibit some variation about different metrics.