pytorch / audio

Data manipulation and transformation for audio signal processing, powered by PyTorch
https://pytorch.org/audio
BSD 2-Clause "Simplified" License
2.43k stars 636 forks source link

[dicscussion] Batched CPU/GPU audio decoding / encoding #2159

Open vadimkantorov opened 2 years ago

vadimkantorov commented 2 years ago

🚀 The feature

GPU audio decoding at least for some codecs is useful for wider usage of compressed audio for training ASR models.

Maybe some neural codecs (I think Google open-sourced some neural codecs) would be more amenable to batched GPU decoding and integration into models directly.

Motivation, pitch

N/A

Alternatives

No response

Additional context

No response

mthrok commented 2 years ago

Hi

I am not aware of a GPU codec library for audio. Do you know one?

vadimkantorov commented 2 years ago

I'm also not aware, but maybe Lyra / SoundStream / LPCnet could be implemented on GPU (except maybe entropy coding). Also, maybe just some recommended codecs / settings optimized for fast decoding with DataLoader in DDP regime would be beneficial for community (e.g. how to configure it so that there is no thread oversubscription etc)

mthrok commented 2 years ago

We do not have a good recommendation for the fast loading at the moment. A part of the reason is that the primal focus of the library has been providing domain-specific features that PyTorch does not provide natively. So we were focusing on the components rather than the pipeline. However, this is changing as the library matures and user base grows, and we do acknowledge the demand complex yet efficient data loading, and we are aware of the lack of our integral view of these components.

Having said that, there are couple of things I am considering for efficient data loading. (some of them are very random thoughts) tl;dr, I am thinking that re-designing the whole data loading experience will give more choices of solutions. It seems to me that libffcv like you mentioned in #1994 takes a similar approach.

vadimkantorov commented 9 months ago

Regarding parallelized (intra-file, for large files) audio decoding:

It might be possible to have a parallelized GPU decoding of some lightly-compressed files (e.g. FLAC or some other relatively simple audio codec which is targeted for fast branchless parallelizable decoding)

It might also be good to integrate some Facebook's neural codec code into torchaudio to widen exposure and usage, as neural codecs are the most amenable to fast GPU-based decoding :)

vadimkantorov commented 7 months ago

Also, batched parallelized audio reading/decoding can be used for speeding-up simple high-level methods like whisper's model.transcribe(['audio1.opus', 'audio2.opus', ...]): https://github.com/openai/whisper/discussions/662#discussioncomment-7524821. Probably the right way for this would be always returning a NestedTensor as output (and allowing finer controls if out= is provided). Might also be interesting to support some sort of background processing mode which returns some LazyTensor output immediately.

Regarding the wav loading, I don't think it can go beyond simply reading from disk in a single large chunk as does python's wave package or scipy's scipy.io.wavfile.read - including mmap btw which in some cases may allow ammortizing the mem-access or disk-reading cost. I think, PyTorch needs to have a similar builtin simple function for dealing with such simple file formats (like WAV/PPM/OBJ/CSV file formats etc).

mthrok commented 7 months ago

I think one needs C/C++ thread pool library to implement true batch decoding. I have some idea, but I feel that it is not a good fit for torchaudio or other domain libraries.

vadimkantorov commented 7 months ago

Maybe some standard openmp threading would cut it?

Might be a better fit for this new i/o package :)

mthrok commented 7 months ago

Maybe some standard openmp threading would cut it?

PyTorch uses OpenMP as well, so I think it is better to have a separate parallelisms so that they can be configured in independent manner. I feel that it will be also better to have different parallelism for I/O-bound part (file access and networking) and CPU-bound part (decoding).

Might be a better fit for this new i/o package :)

The new i/o package in discussion is upstream of the existing domain libraries. I have a feeling that such serious project could better get started outside of the existing context.