pytorch / audio

Data manipulation and transformation for audio signal processing, powered by PyTorch
https://pytorch.org/audio
BSD 2-Clause "Simplified" License
2.54k stars 653 forks source link

read mp3 file fail #2867

Open Tungway1990 opened 1 year ago

Tungway1990 commented 1 year ago

🐛 Describe the bug

I am trying to load commonvoice mp3 files using torchaudio with below code:

import torchaudio
array, sampling_rate = torchaudio.load(path_or_file, format="mp3")

I get an empty output:

Out[4]: tensor([], size=(1, 0))

I find the root cause in file soundfile_backend.py

    with soundfile.SoundFile(filepath, "r") as file_:
        if file_.format != "WAV" or normalize:
            dtype = "float32"

by changing float32 to float64, the array can be generated

tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ..., -2.9407e-05,
         -3.2597e-05, -2.5751e-05]], dtype=torch.float64)

Attached a mp3 file for your reference common_voice_zh-HK_20096730.zip

The ffmpeg version I am using is 5.1.2.

Thanks.

Versions

Collecting environment information... PyTorch version: 1.12.0 Is debug build: False CUDA used to build PyTorch: 11.6 ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Pro GCC version: Could not collect Clang version: Could not collect CMake version: version 3.24.0-rc3 Libc version: N/A

Python version: 3.9.12 (main, Apr 4 2022, 05:22:27) [MSC v.1916 64 bit (AMD64)] (64-bit runtime) Python platform: Windows-10-10.0.19045-SP0 Is CUDA available: True CUDA runtime version: 11.6.124 CUDA_MODULE_LOADING set to: GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 Ti Nvidia driver version: 516.94 cuDNN version: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6\bin\cudnn_ops_train64_8.dll HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Versions of relevant libraries: [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.23.3 [pip3] numpydoc==1.2 [pip3] pytorchvideo==0.1.5 [pip3] torch==1.12.0 [pip3] torch-geometric==2.0.4 [pip3] torch-geometric-temporal==0.53.0 [pip3] torch-scatter==2.0.9 [pip3] torch-sparse==0.6.13 [pip3] torchaudio==0.12.0 [pip3] torchfile==0.1.0 [pip3] torchvision==0.13.0 [conda] Could not collect

mthrok commented 1 year ago

Hi @Tungway1990

Thanks for the report. This issue is originated from soundfile package.

I can reproduce this with bare soundfile.

data, samplerate = sf.read(source)
print(data.shape, data.dtype)

data, samplerate = sf.read(source, dtype="float32")
print(data.shape, data.dtype)
(200704,) float64
(0,) float32

A similar issue is already reported, and the root cause seems to be libsndfile. https://github.com/bastibe/python-soundfile/issues/349

We could make special treatment for mp3, and load it as float64 once, then convert it to float32 if necessary. @pytorch/team-audio-core Any thoughts?

Tungway1990 commented 1 year ago

Yup, I agree with you this is the third party issue

ggold7046 commented 1 year ago

Hello guys, I'm interested in contributing to PyTorch. I am learning python. Is there any way I can contribute to the project ? Could anyone please guide me?

mthrok commented 1 year ago

We need to add special handling to MP3 so that it's loaded as dtype64 first, then converted to the one required by the client code.

It's somewhere here

https://github.com/pytorch/audio/blob/1717edaa8cddf5068df97e30404d85654f0b55f4/torchaudio/backend/soundfile_backend.py#L206-L211

Tungway1990 commented 1 year ago

This is my own implementation, you can take a look


with soundfile.SoundFile(filepath, "r") as file_:
        if file_.format != "WAV" or normalize:
            dtype = "float64"
        elif file_.subtype not in _SUBTYPE2DTYPE:
            raise ValueError(f"Unsupported subtype: {file_.subtype}")
        else:
            dtype = _SUBTYPE2DTYPE[file_.subtype]

        frames = file_._prepare_read(frame_offset, None, num_frames)
        waveform = file_.read(frames, dtype, always_2d=True).astype('float32')