google / visqol

Perceptual Quality Estimator for speech and audio
Apache License 2.0
683 stars 124 forks source link

Does visqol use gpu? Best settings for evaluating noise supression? #80

Open opooladz opened 1 year ago

opooladz commented 1 year ago

Hi thanks for the repo.

Quick question, when I am running visqol I am not seeing any gpu usage. Should I be? Perhaps my bazel did not installed correctly or the version of TF being used is not utilizing the gpu. I am running over thousands of files and it's taking quite some time...

Also just wanted to check what the best settings are for evaluating noise suppression using visqol? I see the two flags --use_speech_mode --use_unscaled_speech_mos_mapping, if I use this might it ignore some bands of noise that may be present in the file (I see its sensitive up to 8kHz)? Should I run visqol in audio mode and speech mode and average the two (perhaps a weighted avg)?

Thanks for your guidance in advance.

mchinen commented 1 year ago

Hi, thanks for the question! ViSQOL does have a TFLite model, but it runs on CPU and is not the main bottleneck. Even in batch mode, it evaluates the list of files serially. This could be improved.

I don't recommend averaging the two modes, because they are quite different in scale. We don't yet have support for greater than wideband speech, and it's a limitation. For noise suppression, ViSQOL will require the clean reference, which isn't always available. If you're looking for a no-reference model specifically for noise suppression, I'd recommend DNSMOS.

opooladz commented 1 year ago

Hi, thanks for the quick response. I actually have access to the clean speech as well as the noisy speech, so I can use a reference metric. I will look into DNSMOS as well. Right now, I am using PESQ (sample referential), as well as Fréchet Audio Distance (reference-free or dataset referential).

Assume a model $X = S + N$. $S$ is speech and $N$ is noise.

I feed $X$ into a noise suppressor and get $\hat{S}$ So we have $X$ and $\hat{S},$ if we do ViSQOL( $S,X$ ) under speech settings might it actually ignore certain frequencies where noise occurs in $X$ (since it's only sensitive up to 8khz)? Same with ViSQOL( $S,\hat{S}$ )

Right now, I am getting the following results averaged over 10k samples.

Audio Settings: ViSQOL( $S,X$ ) = 3.1 ViSQOL( $S,\hat{S}$ ) = 3.7

Speech Settings: ViSQOL( $S,X$ ) = 1.2 ViSQOL( $S,\hat{S}$ ) = 1.9

Just wondering what the recommended settings for is using ViSQOL in my task. Perhaps they are both inciteful in different ways. If so maybe, you can help me understand the intuition/meaning of the results.