Open tcwalther opened 1 year ago
Hi @tcwalther
Thanks for the suggestion. I see the value in such streaming processing, however, wrapping the entire StreamReader
feels a bit too much for the end goal. It gives questions like how does it work with video streaming or audio+video combined streaming etc...
What I think would be more applicable is a helper structure that does caching and padding, which can be used to achieve the goal. For example, in https://pytorch.org/audio/main/tutorials/online_asr_tutorial.html#configure-the-audio-stream, we defined a helper structure ContextCacher
, which can be attached on the stream tensors coming out of StreamReader
. Such implementation is generic and independent from StreamReader
, yet makes it easy to building a wrapper like AudioBlockReader
.
So my suggestion is to add a helper structure similar to ContextCacher
which supports padding on both side. What do you think?
🚀 The feature
I've written an
AudioBlockReader
that wraps StreamReader to return chunks of audio that are padded left and right with valid data.Motivation, pitch
Let's say we have an algorithm that works on spectrogram chunks of 10 seconds. We can load the entire audio signal
x
, apply an STFT to get the spectrogramX
, then slice the the spectrogram and later invert the entire spectrogram to get back the time-domain audio signalx_hat
. However, when your signal is long, you risk running out of memory.In this case you want to do it in a streaming fashion. However, when reading chunks with
StreamReader
, you don't get an overlap. Of course, torch.stft can pad the edges for you, but then they're padded with reflect and not with valid audio.I wrote a class
AudioBlockReader
that takes as parameters an audio file name, a chunk size and a pad size. It yields chunks of constantchunk_size
when iterating over it, and pads chunks equally with pad_size on both sides. At the start/end of the audio file, it pads with reflect as there is no valid audio to pad with.To be more precise,
AudioBlockReader
yields a tuple of:(2*padding + chunk_size, num_channels)
containing left- and right-padded audio datavalid_samples < chunk_size
, you know you've reached the last chunk, and you know how many samples are valid.I'd love to make a PR for that if there's a desire to have this in torchaudio directly.
Alternatives
No response
Additional context
No response