Open XiaoshanHsj opened 1 day ago
the test set is test-clean of LibriTTS, and the number of samples is 4833
Thank you for doing such great work and open-sourcing it.
I use the large model (WavTokenizer-large-320-24k-4096) to reconstruct audio of LibriTTS. However, the results are worse than those reported in paper, which used the small model.
It is UTMOS_raw 19604.11721920967 4.056303997353543 UTMOS_encodec 19604.11721920967 3.8397375189096272 PESQ: 9956.64894938469 2.060138412866685 F1_score: 4432.935466635334 0.917602042358794 2 STOI: 0.8924008398453133
While in paper, it is UTMOS_encodec 4.0486 PESQ 2.3730 STOI 0.9139
Is it exceptd for the performance to degrade?
Thanks~
Due to the significant increase in generalization capabilities of large models, I observed a slight performance drop on the LibriTTS test-clean dataset (though the difference is minimal). However, your results may also be influenced by other factors, such as cuda version, and it seems that four entries are missing from your test set. Moreover, subject evaluation may be also important. Thank you~
thanks for your reply, I am using the small model to reconstruct the wavforms. The results are:
UTMOS_raw 19604.11721920967 4.056303997353543 UTMOS_encodec 19604.11721920967 3.9794073770832084 PESQ: 11974.47469329834 2.477648395054488 F1_score: 4487.17120589376 0.9290209536011925 3 STOI: 0.9199737990446866
thanks for your reply, I am using the small model to reconstruct the wavforms. The results are:
UTMOS_raw 19604.11721920967 4.056303997353543 UTMOS_encodec 19604.11721920967 3.9794073770832084 PESQ: 11974.47469329834 2.477648395054488 F1_score: 4487.17120589376 0.9290209536011925 3 STOI: 0.9199737990446866
ok, It appears that the results exhibit some variation about different metrics.
Thank you for doing such great work and open-sourcing it.
I use the large model (WavTokenizer-large-320-24k-4096) to reconstruct audio of LibriTTS. However, the results are worse than those reported in paper, which used the small model.
It is UTMOS_raw 19604.11721920967 4.056303997353543 UTMOS_encodec 19604.11721920967 3.8397375189096272 PESQ: 9956.64894938469 2.060138412866685 F1_score: 4432.935466635334 0.917602042358794 2 STOI: 0.8924008398453133
While in paper, it is UTMOS_encodec 4.0486 PESQ 2.3730 STOI 0.9139
Is it exceptd for the performance to degrade?
Thanks~