stream reader that supports padded windows with correct overlap

🚀 The feature

I've written an AudioBlockReader that wraps StreamReader to return chunks of audio that are padded left and right with valid data.

Motivation, pitch

Let's say we have an algorithm that works on spectrogram chunks of 10 seconds. We can load the entire audio signal x, apply an STFT to get the spectrogram X, then slice the the spectrogram and later invert the entire spectrogram to get back the time-domain audio signal x_hat. However, when your signal is long, you risk running out of memory.

In this case you want to do it in a streaming fashion. However, when reading chunks with StreamReader, you don't get an overlap. Of course, torch.stft can pad the edges for you, but then they're padded with reflect and not with valid audio.

I wrote a class AudioBlockReader that takes as parameters an audio file name, a chunk size and a pad size. It yields chunks of constant chunk_size when iterating over it, and pads chunks equally with pad_size on both sides. At the start/end of the audio file, it pads with reflect as there is no valid audio to pad with.

To be more precise, AudioBlockReader yields a tuple of:

a Tensor of shape (2*padding + chunk_size, num_channels) containing left- and right-padded audio data
an integer of the valid number of samples in the Tensor (excluding padding) - the last chunk is likely only partial, and when valid_samples < chunk_size, you know you've reached the last chunk, and you know how many samples are valid.

I'd love to make a PR for that if there's a desire to have this in torchaudio directly.

Alternatives

No response

Additional context

No response

Hi @tcwalther

Thanks for the suggestion. I see the value in such streaming processing, however, wrapping the entire StreamReader feels a bit too much for the end goal. It gives questions like how does it work with video streaming or audio+video combined streaming etc...

What I think would be more applicable is a helper structure that does caching and padding, which can be used to achieve the goal. For example, in https://pytorch.org/audio/main/tutorials/online_asr_tutorial.html#configure-the-audio-stream, we defined a helper structure ContextCacher, which can be attached on the stream tensors coming out of StreamReader. Such implementation is generic and independent from StreamReader, yet makes it easy to building a wrapper like AudioBlockReader.

So my suggestion is to add a helper structure similar to ContextCacher which supports padding on both side. What do you think?

pytorch / audio