pytorch / audio

Data manipulation and transformation for audio signal processing, powered by PyTorch
https://pytorch.org/audio
BSD 2-Clause "Simplified" License
2.5k stars 644 forks source link

Batch processing torchaudio-squim #3424

Open bloodraven66 opened 1 year ago

bloodraven66 commented 1 year ago

🚀 The feature

This is regarding the objective and subjective metrics available as part of torchaudio-squim (https://pytorch.org/audio/main/tutorials/squim_tutorial.html#sphx-glr-tutorials-squim-tutorial-py).

Currently, it works only at batch size = 1. i.e, the waveforms are expected to be of shape (1, N). Can we have batch level processing?

Motivation, pitch

Researchers usually use these metrics on test sets and a range of model configurations. I'm also looking at using the subjective model on multiple non-matching references for a single audio. Batch processing will help a lot in speeding up the processing time.

Alternatives

No response

Additional context

No response

nateanl commented 1 year ago

Hi @bloodraven66, thanks for trying the squim model for evaluation. Actually you can! Just pass a batch tensor to the model, it will generate scores also in batch.

mthrok commented 1 year ago

I guess we can update the tutorial so that it contains links to documentation, using :py:class:.

I had a bit of difficulty to find the answer to this, but the following seems to be the documentation.

https://pytorch.org/audio/main/generated/torchaudio.prototype.SquimSubjective.html#forward

jfsantos commented 8 months ago

There is batch processing but if sequences are of different length and you end up having to pad them to be the same length, predictions are changed as masking is not supported. Are sequences of different length used during training? If so, is there any masking that could be introduced into the implementation for inference?

nateanl commented 8 months ago

During training all audio samples are truncated to 5 seconds. Masking is difficult to support for the Objective model since the model uses DPRNN as backbone, how to transpose the mask along with the RNN input needs to be considered.

DigitalPhoneme commented 7 months ago

Is there a way to fine-tune a squim subjective model with my own data? What kind of data would I have to use and how would I go about fine tuning (in a high level). Is there documentation

nateanl commented 7 months ago

@fullstackmedusa You need a dataset of paired waveforms and numerical labels (from 1 to 5), and another clean speech dataset as reference. You can find the details in https://arxiv.org/abs/2206.12285