pytorch / data

A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.
BSD 3-Clause "New" or "Revised" License
1.13k stars 152 forks source link

Loading audio files from archives #760

Open nikvaessen opened 2 years ago

nikvaessen commented 2 years ago

🐛 Describe the bug

I've been playing around with torchdata as a replacement for the webdataset library. My main use-case is reading data from network-attached file systems (such as ceph), which implies streaming from e.g. .tar files, which is something webdataset is designed for.

In the following code I have the following relative file system: data.zip

├── file
│   ├── 19-198-0000.flac
│   └── 19-198-0000.wav
├── tar
│   ├── flac.tar
│   └── wav.tar
└── zip
    ├── flac.zip
    └── wav.zip

Where each .zip or .tar archive contains respectively the 19-198-0000.flac or 19-198-0000.wav file taken from the LibriSpeech dataset.

From my reading of the documentation, this seams the easiest way to read from the archive:

import torchaudio.backend.sox_io_backend as tab

from torchdata.datapipes.iter import (
    FileLister,
    FileOpener,
    TarArchiveLoader,
    ZipArchiveLoader,
    Mapper,
)

def audio_stream_to_tensor(element):
    path, stream = element

    audio_tensor, sample_rate = tab.load(stream)

    return audio_tensor

dp = FileLister(".", masks=["wav.tar"], recursive=True)
dp = FileOpener(dp, mode="b")
dp = TarArchiveLoader(dp, mode="r")
dp = Mapper(dp, audio_stream_to_tensor)

for x in dp:
    print(x) # tensor([[0.0044, 0.0033, 0.0031,  ..., 0.0047, 0.0060, 0.0060]])

This works :)! However, it fails when we try to read the flac.tar

dp = FileLister(".", masks=["flac.tar"], recursive=True)
dp = FileOpener(dp, mode="b")
dp = TarArchiveLoader(dp, mode="r")
dp = Mapper(dp, audio_stream_to_tensor)

for x in dp:
    print(x)
formats: can't open input file `': FLAC ERROR whilst decoding metadata
Traceback (most recent call last):
  File "/home/nik/phd/repo/librispeech/playground/example.py", line 35, in <module>
    for x in dp:
  File "/home/nik/phd/repo/librispeech/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/_typing.py", line 514, in wrap_generator
    response = gen.send(None)
  File "/home/nik/phd/repo/librispeech/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 116, in __iter__
    yield self._apply_fn(data)
  File "/home/nik/phd/repo/librispeech/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 81, in _apply_fn
    return self.fn(data)
  File "/home/nik/phd/repo/librispeech/playground/example.py", line 15, in audio_stream_to_tensor
    audio_tensor, sample_rate = tab.load(stream)
  File "/home/nik/phd/repo/librispeech/.venv/lib/python3.10/site-packages/torchaudio/backend/sox_io_backend.py", line 220, in load
    return _fallback_load_fileobj(filepath, frame_offset, num_frames, normalize, channels_first, format)
  File "/home/nik/phd/repo/librispeech/.venv/lib/python3.10/site-packages/torchaudio/io/_compat.py", line 109, in load_audio_fileobj
    s = torchaudio._torchaudio_ffmpeg.StreamReaderFileObj(src, format, None, 4096)
RuntimeError: Failed to open the input "StreamWrapper<<ExFileObject name='./tar/flac.tar'>>" (Invalid data found when processing input).
This exception is thrown by __iter__ of MapperIterDataPipe(datapipe=TarArchiveLoaderIterDataPipe, fn=audio_stream_to_tensor, input_col=None, output_col=None)

Similarly for ZipArchiveLoader, reading from wav.zip works, while flac.zip returns a similar error:

formats: can't open input file `': FLAC ERROR whilst decoding metadata
Traceback (most recent call last):
  File "/home/nik/phd/repo/librispeech/playground/example.py", line 35, in <module>
    for x in dp:
  File "/home/nik/phd/repo/librispeech/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/_typing.py", line 514, in wrap_generator
    response = gen.send(None)
  File "/home/nik/phd/repo/librispeech/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 116, in __iter__
    yield self._apply_fn(data)
  File "/home/nik/phd/repo/librispeech/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 81, in _apply_fn
    return self.fn(data)
  File "/home/nik/phd/repo/librispeech/playground/example.py", line 15, in audio_stream_to_tensor
    audio_tensor, sample_rate = tab.load(stream)
  File "/home/nik/phd/repo/librispeech/.venv/lib/python3.10/site-packages/torchaudio/backend/sox_io_backend.py", line 220, in load
    return _fallback_load_fileobj(filepath, frame_offset, num_frames, normalize, channels_first, format)
  File "/home/nik/phd/repo/librispeech/.venv/lib/python3.10/site-packages/torchaudio/io/_compat.py", line 109, in load_audio_fileobj
    s = torchaudio._torchaudio_ffmpeg.StreamReaderFileObj(src, format, None, 4096)
RuntimeError: Failed to open the input "StreamWrapper<<zipfile.ZipExtFile name='19-198-0000.flac' mode='r' compress_type=deflate>>" (Invalid data found when processing input).
This exception is thrown by __iter__ of MapperIterDataPipe(datapipe=ZipArchiveLoaderIterDataPipe, fn=audio_stream_to_tensor, input_col=None, output_col=None)

Moreover, adding torchaudio.info to the map function also leads to the same issue for .wav files:

def audio_stream_to_tensor_and_meta(element):
    path, stream = element

    meta = tab.info(stream)
    print(meta)
    audio_tensor, sample_rate = tab.load(stream)

    return audio_tensor, meta

dp = FileLister(".", masks=["wav.tar"], recursive=True)
dp = FileOpener(dp, mode="b")
dp = TarArchiveLoader(dp, mode="r")
dp = Mapper(dp, audio_stream_to_tensor_and_meta)

for x in dp:
    print(x)
AudioMetaData(sample_rate=16000, num_frames=31440, num_channels=1, bits_per_sample=16, encoding=PCM_S)
formats: can't determine type of file `'
Traceback (most recent call last):
  File "/home/nik/phd/repo/librispeech/playground/example.py", line 36, in <module>
    for x in dp:
  File "/home/nik/phd/repo/librispeech/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/_typing.py", line 514, in wrap_generator
    response = gen.send(None)
  File "/home/nik/phd/repo/librispeech/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 116, in __iter__
    yield self._apply_fn(data)
  File "/home/nik/phd/repo/librispeech/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 81, in _apply_fn
    return self.fn(data)
  File "/home/nik/phd/repo/librispeech/playground/example.py", line 25, in audio_stream_to_tensor_and_meta
    audio_tensor, sample_rate = tab.load(stream)
  File "/home/nik/phd/repo/librispeech/.venv/lib/python3.10/site-packages/torchaudio/backend/sox_io_backend.py", line 220, in load
    return _fallback_load_fileobj(filepath, frame_offset, num_frames, normalize, channels_first, format)
  File "/home/nik/phd/repo/librispeech/.venv/lib/python3.10/site-packages/torchaudio/io/_compat.py", line 109, in load_audio_fileobj
    s = torchaudio._torchaudio_ffmpeg.StreamReaderFileObj(src, format, None, 4096)
RuntimeError: Failed to open the input "StreamWrapper<<ExFileObject name='./tar/wav.tar'>>" (Invalid data found when processing input).
This exception is thrown by __iter__ of MapperIterDataPipe(datapipe=TarArchiveLoaderIterDataPipe, fn=audio_stream_to_tensor_and_meta, input_col=None, output_col=None)

So I assume that the issues stem from the fact that the stream provided by torchdata is not seekable, or at least the buffer is not large enough?

Versions

PyTorch version: 1.12.1+cu102 Is debug build: False CUDA used to build PyTorch: 10.2 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: Could not collect CMake version: version 3.16.3 Libc version: glibc-2.31

Python version: 3.10.4 (main, Apr 20 2022, 11:26:44) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-5.15.0-46-generic-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: 11.5.119 GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3070 Nvidia driver version: 495.29.05 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Versions of relevant libraries: [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.23.2 [pip3] torch==1.12.1 [pip3] torchaudio==0.12.1 [pip3] torchdata==0.4.1 [conda] Could not collect

nikvaessen commented 2 years ago

I've also tried the soundfile backend. Soundfile can read the .flac file correctly from the stream, but it fails when we call info() on the stream before load().

ejguan commented 2 years ago

RuntimeError: Failed to open the input "StreamWrapper<<zipfile.ZipExtFile name='19-198-0000.flac' mode='r' compress_type=deflate>>" (Invalid data found when processing input).

Based on the traceback, I think it's about how does torchaudio expect the input type. It would be easier for us to understand the functionality of tab.load. Does it support loading inner file streams from tar? cc: @mthrok

Regarding your comment about seekable, at least tar file stream should be seekable. So, I assume this won't be the root cause.

As a workaround, could you read data from the opened file stream directly before sending to tab.load?

def audio_stream_to_tensor_and_meta(element):
    path, stream = element

    data = b"".join(stream)

    meta = tab.info(data)
    audio_tensor, sample_rate = tab.load(data)

    return audio_tensor, meta
nikvaessen commented 2 years ago

Thanks for your comment.

As a workaround, could you read data from the opened file stream directly before sending to tab.load?

Your code sample throws the following errors:

(for wav)

Traceback (most recent call last):
  File "/home/nik/phd/repo/data_utility/playground/example.py", line 39, in <module>
    for x in dp:
  File "/home/nik/phd/repo/data_utility/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/_typing.py", line 514, in wrap_generator
    response = gen.send(None)
  File "/home/nik/phd/repo/data_utility/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 116, in __iter__
    yield self._apply_fn(data)
  File "/home/nik/phd/repo/data_utility/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 81, in _apply_fn
    return self.fn(data)
  File "/home/nik/phd/repo/data_utility/playground/example.py", line 17, in audio_stream_to_tensor
    audio_tensor, sample_rate = tab.load(data)
  File "/home/nik/phd/repo/data_utility/.venv/lib/python3.10/site-packages/torchaudio/backend/sox_io_backend.py", line 227, in load
    return _fallback_load(filepath, frame_offset, num_frames, normalize, channels_first, format)
  File "/home/nik/phd/repo/data_utility/.venv/lib/python3.10/site-packages/torchaudio/io/_compat.py", line 97, in load_audio
    s = torch.classes.torchaudio.ffmpeg_StreamReader(src, format, None)
RuntimeError

(for flac)

Traceback (most recent call last):
  File "/home/nik/phd/repo/data_utility/playground/example.py", line 39, in <module>
    for x in dp:
  File "/home/nik/phd/repo/data_utility/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/_typing.py", line 514, in wrap_generator
    response = gen.send(None)
  File "/home/nik/phd/repo/data_utility/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 116, in __iter__
    yield self._apply_fn(data)
  File "/home/nik/phd/repo/data_utility/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 81, in _apply_fn
    return self.fn(data)
  File "/home/nik/phd/repo/data_utility/playground/example.py", line 17, in audio_stream_to_tensor
    audio_tensor, sample_rate = tab.load(data)
  File "/home/nik/phd/repo/data_utility/.venv/lib/python3.10/site-packages/torchaudio/backend/sox_io_backend.py", line 227, in load
    return _fallback_load(filepath, frame_offset, num_frames, normalize, channels_first, format)
  File "/home/nik/phd/repo/data_utility/.venv/lib/python3.10/site-packages/torchaudio/io/_compat.py", line 97, in load_audio
    s = torch.classes.torchaudio.ffmpeg_StreamReader(src, format, None)
RuntimeError: Failed to open the input "fLaC
This exception is thrown by __iter__ of MapperIterDataPipe(datapipe=TarArchiveLoaderIterDataPipe, fn=audio_stream_to_tensor, input_col=None, output_col=None)

However, simply using stream.seek(0) between tab.info() and tab.load() solves the issue for both TarArchiveLoader and ZipArchiveLoader. It this something which is worth documenting?

Moreover, loading .flac files remains an issue for the sox_io backend. But I guess that now seems to be an issue related to torchaudio?

mthrok commented 2 years ago

However, simply using stream.seek(0) between tab.info() and tab.load() solves the issue for both TarArchiveLoader and ZipArchiveLoader. It this something which is worth documenting?

info consumes some bytes from file-like object, so it calling load after that would fail without reseting the position of the input file object.

Moreover, loading .flac files remains an issue for the sox_io backend. But I guess that now seems to be an issue related to torchaudio?

There are reports filed recently on file-like object loading of FLAC format. I haven't looked into the detail yet, but meanwhile I think ffmpeg-based solution could work. Can you tell what happens if you replace load function with torchaudio.io._compat.load_audio_fileobj?

nikvaessen commented 2 years ago

Replacing load with torchaudio.io._compat.load_audio_fileobj results in the flac stream correctly loading.

nikvaessen commented 2 years ago

Similarly, replacing info with torchaudio.io._compat.info_audio_fileobj(stream, format='flac') results in the flac stream info loading.

AudioMetaData(sample_rate=16000, num_frames=0, num_channels=1, bits_per_sample=16, encoding=FLAC)

However, num_frames=0 is incorrect.

Using info(stream, format='flac') does work, but also gives an error (and num_frames=0 is wrong):

def audio_stream_to_tensor_and_meta(element):
    path, stream = element

    meta = torchaudio.info(stream, format='flac')
    stream.seek(0)
    audio_tensor, sample_rate = torchaudio.io._compat.load_audio_fileobj(stream)

    return audio_tensor, meta
formats: can't open input file `': FLAC ERROR whilst decoding metadata
tensor([[0.0044, 0.0033, 0.0031,  ..., 0.0047, 0.0060, 0.0060]])
AudioMetaData(sample_rate=16000, num_frames=0, num_channels=1, bits_per_sample=16, encoding=FLAC)

Using only info(stream), without format="flac":

formats: can't open input file `': FLAC ERROR whilst decoding metadata
Traceback (most recent call last):
  File "/home/nik/phd/repo/data_utility/playground/example.py", line 41, in <module>
    for x in dp:
  File "/home/nik/phd/repo/data_utility/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/_typing.py", line 514, in wrap_generator
    response = gen.send(None)
  File "/home/nik/phd/repo/data_utility/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 116, in __iter__
    yield self._apply_fn(data)
  File "/home/nik/phd/repo/data_utility/.venv/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py", line 81, in _apply_fn
    return self.fn(data)
  File "/home/nik/phd/repo/data_utility/playground/example.py", line 26, in audio_stream_to_tensor_and_meta
    meta = torchaudio.info(stream)
  File "/home/nik/phd/repo/data_utility/.venv/lib/python3.10/site-packages/torchaudio/backend/sox_io_backend.py", line 99, in info
    return _fallback_info_fileobj(filepath, format)
  File "/home/nik/phd/repo/data_utility/.venv/lib/python3.10/site-packages/torchaudio/io/_compat.py", line 35, in info_audio_fileobj
    s = torchaudio._torchaudio_ffmpeg.StreamReaderFileObj(src, format, None, 4096)
RuntimeError: Failed to open the input "StreamWrapper<<ExFileObject name='./tar/flac.tar'>>" (Invalid data found when processing input).
This exception is thrown by __iter__ of MapperIterDataPipe(datapipe=TarArchiveLoaderIterDataPipe, fn=audio_stream_to_tensor_and_meta, input_col=None, output_col=None)

FFMPEG output of the file:

$ ffmpeg -i playground/file/19-198-0000.flac 
...
Input #0, flac, from 'playground/file/19-198-0000.flac':
  Duration: 00:00:01.97, start: 0.000000, bitrate: 177 kb/s
    Stream #0:0: Audio: flac, 16000 Hz, mono, s16

Reading from the file directly:

torchaudio.info('19-198-0000.flac")
AudioMetaData(sample_rate=16000, num_frames=31440, num_channels=1, bits_per_sample=16, encoding=FLAC)
saicoco commented 2 years ago

Maybe you can try thisstream.file_obj.read() to get bytes:

def audio_stream_to_tensor_and_meta(element):
    path, stream = element
    stream = stream.file_obj.read()
    ...
    return audio_tensor, meta