pytorch / audio

Data manipulation and transformation for audio signal processing, powered by PyTorch
https://pytorch.org/audio
BSD 2-Clause "Simplified" License
2.53k stars 651 forks source link

stream reader that supports padded windows with correct overlap #3641

Open tcwalther opened 1 year ago

tcwalther commented 1 year ago

🚀 The feature

I've written an AudioBlockReader that wraps StreamReader to return chunks of audio that are padded left and right with valid data.

Motivation, pitch

Let's say we have an algorithm that works on spectrogram chunks of 10 seconds. We can load the entire audio signal x, apply an STFT to get the spectrogram X, then slice the the spectrogram and later invert the entire spectrogram to get back the time-domain audio signal x_hat. However, when your signal is long, you risk running out of memory.

In this case you want to do it in a streaming fashion. However, when reading chunks with StreamReader, you don't get an overlap. Of course, torch.stft can pad the edges for you, but then they're padded with reflect and not with valid audio.

I wrote a class AudioBlockReader that takes as parameters an audio file name, a chunk size and a pad size. It yields chunks of constant chunk_size when iterating over it, and pads chunks equally with pad_size on both sides. At the start/end of the audio file, it pads with reflect as there is no valid audio to pad with.

To be more precise, AudioBlockReader yields a tuple of:

I'd love to make a PR for that if there's a desire to have this in torchaudio directly.

Alternatives

No response

Additional context

No response

mthrok commented 1 year ago

Hi @tcwalther

Thanks for the suggestion. I see the value in such streaming processing, however, wrapping the entire StreamReader feels a bit too much for the end goal. It gives questions like how does it work with video streaming or audio+video combined streaming etc...

What I think would be more applicable is a helper structure that does caching and padding, which can be used to achieve the goal. For example, in https://pytorch.org/audio/main/tutorials/online_asr_tutorial.html#configure-the-audio-stream, we defined a helper structure ContextCacher, which can be attached on the stream tensors coming out of StreamReader. Such implementation is generic and independent from StreamReader, yet makes it easy to building a wrapper like AudioBlockReader.

So my suggestion is to add a helper structure similar to ContextCacher which supports padding on both side. What do you think?