Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.
This pull request includes an overall refactor to the Amphion Evaluation module for faster inference speed and more computational modes for compatibility, which includes:
CUDA support: Previously, all metrics are computed on the CPU. However, the computational speed will be incompatible with large-scale test sets that contain more than 10000 utterances. Currently, except for MCD and PESQ, which utilize totally encapsulated external packages, all the other metrics can now be computed on the GPU device to speed up.
Support for computing intelligibility-related metrics with the ground truth transcript: Previously, CER and WER are computed on the transcript of an ASR model between the ground truth and the predicted audio. However, the ground truth transcript is available for tasks like TTS. Currently, it is possible to compute CER and WER between the ground truth transcript and the output of applying the ASR model on predicted audio.
Support for similarity computation without reference or ground truth: Previously, speaker similarity is computed between a series of ground truth/prediction audios. However, such audio pairs may not be available for tasks like SVC. Currently, speaker similarity can be computed in a reference-free way that calculates the average score between all possible audio pairs in the reference and generated folder.
✨ Description
This pull request includes an overall refactor to the Amphion Evaluation module for faster inference speed and more computational modes for compatibility, which includes:
🚧 Related Issues
110
👨💻 Changes Proposed
🧑🤝🧑 Who Can Review?
@lmxue @HeCheng0625 @zhizhengwu
✅ Checklist