OPUS reading is ~3x slower compared to ffmpeg in a subprocess

pzelasko commented 3 years ago

🐛 Describe the bug

Technically it's not a bug, but it was the most fitting category. I benchmarked torchaudio vs ffmpeg for reading a long OPUS file (> 1h long, comes from GigaSpeech). It seems that it's much faster to spawn an ffmpeg process and capture its output than to use torchaudio.load(). Please see the below screenshot:

You can see the ffmpeg-based reading implementation in Lhotse here (note it's a feature branch, not merged for now): https://github.com/lhotse-speech/lhotse/blob/13500bd742160d556cefbb43e810e1fd5680f906/lhotse/audio.py#L1359-L1411

I wonder whether SoX uses a different OPUS decoder than ffmpeg? I noticed that there is some difference between the audio samples when I read the file from torchaudio and ffmpeg.

(version of code that is copy-pastable)

%load_ext lab_black

import lhotse
from lhotse.audio import Recording, AudioSource, read_opus_ffmpeg, read_opus_torchaudio
from pathlib import Path

lhotse.set_caching_enabled(False)

path = "/export/c27/pzelasko/gigaspeech/audio/podcast/P0000/POD1000000040.opus"

%%time
samples2, sr2 = read_opus_ffmpeg(path=path)

%%time
samples, sr = read_opus_torchaudio(path=path)

import torchaudio

%%time
samples3, sr3 = torchaudio.load(path)

%%time
_ = torchaudio.info(path)

Versions

Collecting environment information... PyTorch version: 1.9.0 Is debug build: False CUDA used to build PyTorch: 10.2 ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 9.13 (stretch) (x86_64) GCC version: (Debian 6.3.0-18+deb9u1) 6.3.0 20170516 Clang version: 3.8.1-24 (tags/RELEASE_381/final) CMake version: version 3.21.3 Libc version: glibc-2.10

Python version: 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37) [GCC 9.3.0] (64-bit runtime) Python platform: Linux-4.9.0-15-amd64-x86_64-with-debian-9.13 Is CUDA available: True CUDA runtime version: Could not collect GPU models and configuration: GPU 0: GeForce RTX 2080 Ti GPU 1: GeForce RTX 2080 Ti GPU 2: GeForce RTX 2080 Ti

Nvidia driver version: 440.33.01 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A

Versions of relevant libraries: [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.21.2 [pip3] torch==1.9.0 [pip3] torchaudio==0.9.0 [conda] blas 1.0 mkl
[conda] cudatoolkit 10.2.89 h8f6ccaa_8 conda-forge [conda] k2 1.9.dev20210919 cuda10.2_py3.7_torch1.9.0 k2-fsa [conda] mkl 2021.3.0 h06a4308_520
[conda] mkl-service 2.4.0 py37h5e8e339_0 conda-forge [conda] mkl_fft 1.3.1 py37hd3c417c_0
[conda] mkl_random 1.2.2 py37h219a48f_0 conda-forge [conda] mypy-extensions 0.4.3 pypi_0 pypi [conda] numpy 1.21.2 py37h20f2e39_0
[conda] numpy-base 1.21.2 py37h79a1101_0
[conda] pytorch 1.9.0 py3.7_cuda10.2_cudnn7.6.5_0 pytorch [conda] torchaudio 0.9.0 pypi_0 pypi

mthrok commented 2 years ago

Hi @pzelasko

torchaudio does not do anything special to handle OPUS. [search] SoX's OPUS integration seems to have some edges. In the past I saw the encoding of OPUS causes segfault as well.

So my first impression is that this is resulted from sox's implementation. However x3 is very huge.

I wonder whether SoX uses a different OPUS decoder than ffmpeg? I noticed that there is some difference between the audio samples when I read the file from torchaudio and ffmpeg.

Probably yes. I briefly looked at the ffmpeg code, and OPUS code that SoX adopts, and they do not seem to share the source files. (I recall that xiph.org somewhere on their website claims some of the libraries they provide are reference implementation and not necessarily optimized.)

https://github.com/FFmpeg/FFmpeg/blob/master/libavcodec/opusenc.c https://github.com/xiph/opus/tree/master/src

mthrok commented 2 years ago

Following that, I think practically what we can do (not promise at the moment) is to bind ffmpeg to provide native experience.

We get requests for streaming and other formats, which binding ffmpeg is a viable solution.

pzelasko commented 2 years ago

If you can properly bind ffmpeg into Python, that would be pretty amazing, and also as I imagine, a lot of effort.

Anyway, I’m not expecting a “fix” — just wanted to make sure you’re aware (and in case I’m doing sth obviously wrong).

vadimkantorov commented 2 years ago

It would be nice if torchaudio published some benchmarks of realistic audio decoding perf inside a DataLoader (especially in the view of improvements of https://ffcv.io)...

vadimkantorov commented 2 years ago

the slowdown is interesting because both sox and ffmpeg seem to use internally libopus for decoding:

https://github.com/FFmpeg/FFmpeg/blob/master/libavcodec/libopusdec.c

https://github.com/dmkrepo/libsox/blob/master/src/opus.c

ozancaglayan commented 1 year ago

This is still an issue by the way, for a file with 5 minutes of speech, torchaudio is almost 4x slower than the other two:

opus_48k_32kbps
 > torchaudio per 5 mins (secs) 2.1418089202139527
 > librosa per 5 mins (secs) 0.639855684619397
 > ffmpeg per 5 mins (secs) 0.582485853228718

vadimkantorov commented 7 months ago

This might be related to this bug:

https://github.com/xiph/opus-tools/issues/87

It seems that in modern libopus, the resampler got changed to a much slower one. And I've got some repro/test in that issue.

So if ffmpeg uses a faster built-in resampler and torchvision uses opus-tools resampler, torchvision might be slower

pytorch / audio

OPUS reading is ~3x slower compared to ffmpeg in a subprocess #1994

🐛 Describe the bug

Versions