Open pzelasko opened 3 years ago
Hi @pzelasko
torchaudio does not do anything special to handle OPUS. [search] SoX's OPUS integration seems to have some edges. In the past I saw the encoding of OPUS causes segfault as well.
So my first impression is that this is resulted from sox's implementation. However x3 is very huge.
I wonder whether SoX uses a different OPUS decoder than ffmpeg? I noticed that there is some difference between the audio samples when I read the file from torchaudio and ffmpeg.
Probably yes. I briefly looked at the ffmpeg code, and OPUS code that SoX adopts, and they do not seem to share the source files. (I recall that xiph.org somewhere on their website claims some of the libraries they provide are reference implementation and not necessarily optimized.)
https://github.com/FFmpeg/FFmpeg/blob/master/libavcodec/opusenc.c https://github.com/xiph/opus/tree/master/src
Following that, I think practically what we can do (not promise at the moment) is to bind ffmpeg to provide native experience.
We get requests for streaming and other formats, which binding ffmpeg is a viable solution.
If you can properly bind ffmpeg into Python, that would be pretty amazing, and also as I imagine, a lot of effort.
Anyway, I’m not expecting a “fix” — just wanted to make sure you’re aware (and in case I’m doing sth obviously wrong).
It would be nice if torchaudio published some benchmarks of realistic audio decoding perf inside a DataLoader (especially in the view of improvements of https://ffcv.io)...
the slowdown is interesting because both sox and ffmpeg seem to use internally libopus for decoding:
https://github.com/FFmpeg/FFmpeg/blob/master/libavcodec/libopusdec.c
This is still an issue by the way, for a file with 5 minutes of speech, torchaudio is almost 4x slower than the other two:
opus_48k_32kbps
> torchaudio per 5 mins (secs) 2.1418089202139527
> librosa per 5 mins (secs) 0.639855684619397
> ffmpeg per 5 mins (secs) 0.582485853228718
This might be related to this bug:
It seems that in modern libopus, the resampler got changed to a much slower one. And I've got some repro/test in that issue.
So if ffmpeg uses a faster built-in resampler and torchvision uses opus-tools resampler, torchvision might be slower
🐛 Describe the bug
Technically it's not a bug, but it was the most fitting category. I benchmarked torchaudio vs ffmpeg for reading a long OPUS file (> 1h long, comes from GigaSpeech). It seems that it's much faster to spawn an ffmpeg process and capture its output than to use
torchaudio.load()
. Please see the below screenshot:You can see the ffmpeg-based reading implementation in Lhotse here (note it's a feature branch, not merged for now): https://github.com/lhotse-speech/lhotse/blob/13500bd742160d556cefbb43e810e1fd5680f906/lhotse/audio.py#L1359-L1411
I wonder whether SoX uses a different OPUS decoder than ffmpeg? I noticed that there is some difference between the audio samples when I read the file from torchaudio and ffmpeg.
(version of code that is copy-pastable)
Versions
Collecting environment information... PyTorch version: 1.9.0 Is debug build: False CUDA used to build PyTorch: 10.2 ROCM used to build PyTorch: N/A
OS: Debian GNU/Linux 9.13 (stretch) (x86_64) GCC version: (Debian 6.3.0-18+deb9u1) 6.3.0 20170516 Clang version: 3.8.1-24 (tags/RELEASE_381/final) CMake version: version 3.21.3 Libc version: glibc-2.10
Python version: 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37) [GCC 9.3.0] (64-bit runtime) Python platform: Linux-4.9.0-15-amd64-x86_64-with-debian-9.13 Is CUDA available: True CUDA runtime version: Could not collect GPU models and configuration: GPU 0: GeForce RTX 2080 Ti GPU 1: GeForce RTX 2080 Ti GPU 2: GeForce RTX 2080 Ti
Nvidia driver version: 440.33.01 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A
Versions of relevant libraries: [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.21.2 [pip3] torch==1.9.0 [pip3] torchaudio==0.9.0 [conda] blas 1.0 mkl
[conda] cudatoolkit 10.2.89 h8f6ccaa_8 conda-forge [conda] k2 1.9.dev20210919 cuda10.2_py3.7_torch1.9.0 k2-fsa [conda] mkl 2021.3.0 h06a4308_520
[conda] mkl-service 2.4.0 py37h5e8e339_0 conda-forge [conda] mkl_fft 1.3.1 py37hd3c417c_0
[conda] mkl_random 1.2.2 py37h219a48f_0 conda-forge [conda] mypy-extensions 0.4.3 pypi_0 pypi [conda] numpy 1.21.2 py37h20f2e39_0
[conda] numpy-base 1.21.2 py37h79a1101_0
[conda] pytorch 1.9.0 py3.7_cuda10.2_cudnn7.6.5_0 pytorch [conda] torchaudio 0.9.0 pypi_0 pypi