About Evaluation Metrics

panosapos commented 1 year ago

How can we objectively evaluate our model?
Some random thoughts below:

Intrusive v.s. non-intrusive metrics In speech generation, we generally have 2 kind of metrics: intrusive and non-intrusive. Intrusive metrics require a reference signal that will be compared with the output. This means that they can only be used if we have a pair of input and a reconstruction-output. Ofc, non-intrusive metrics do not require such a pair, but the issue is that they are not widely used, or if they do, they are the results of some other pretrained complex model. Maybe we could use an intrusive metric during validation, in order to keep track of training progress?
L1 and L2 distances between speech signals actually makes no sense, as they do not clearly indicate speech quality, let alone speech perception.
Literally, in every paper they use another combination of objective metrics, so there is not a very clear optimal selection for this.

panosapos commented 1 year ago

A list of Intrusive Metrics: Useful for conditional synthesis, as they require a pair of input and "reconstructed" signal.

PESQ Complex metric that calculates the quality of a speech signal. It was suggested by ITU-T as the recommended metric in Telecommunication systems (telephone and VoIP), but now it has been outdated and replaced by PSQM. However, it is still widely used in DL (especially in speech enhancement models), as it provides a good measure of speech quality. ++ available as a torchmetric
STOI Captures Speech Intelligibility, not speech quality, meaning the fraction of words/syllables that a user would understand. More useful in quite noisy environments. Available as a torch metric

panosapos commented 1 year ago

A list of non-intrusive methods: Useful for Unconditional Audio Generation

FAD (Frechet Audio Distance) Capture the distance between the input and generated distributions A reference-free metric which is designed to measure how a given audio clip compares to clean, studio recorded music (no issue on this: it can be used for speech, we only care about relative values) Uses the activations of the last layer of VGGish, an audio classifier, and the mel-spectrogram, as an embedding, upon which the metric is calculated. The VGGish embeddings are fitted to multi-variate gaussians. The FAD itself is the Fréchet distance between the two distributions Nr and Ng representing the reference and generated gaussian distributions Available here: https://github.com/google-research/google-research/tree/master/frechet_audio_distance

Important: Requires at least 25 minutes of audio, in order to get a stable FAD score

Number of Statistically DIfferent bins Captures the diversity of the genrated samples Apply k-means to the audio space. (clusters == bins) Test samples are assigned to the k clusters/bins using an L2 distance measure between the samples and the centroids of the clusters. The final NDB score is given by counting the number of statistically different bins and dividing by the number of clusters. Initially meant for GANs, in order to investigate the issue of mode collapse., but now is also used in DIffusion models. Available here (not implemented in Torch yet): https://github.com/eitanrich/gans-n-gmms

mustass / diffusion_models_for_speech