Accessing audio dataset value throws Format not recognised error

fawazahmed0 commented 2 weeks ago

Describe the bug

Accessing audio dataset value throws Format not recognised error

Steps to reproduce the bug

code:

from datasets import load_dataset

dataset = load_dataset("fawazahmed0/bug-audio")

for data in dataset["train"]:
    print(data)

output:

(mypy) C:\Users\Nawaz-Server\Documents\ml>python myest.py
[C:\vcpkg\buildtrees\mpg123\src\0d8db63f9b-3db975bc05.clean\src\libmpg123\layer3.c:INT123_do_layer3():1801] error: dequantization failed!
{'audio': {'path': 'C:\\Users\\Nawaz-Server\\.cache\\huggingface\\hub\\datasets--fawazahmed0--bug-audio\\snapshots\\fab1398431fed1c0a2a7bff0945465bab8b5daef\\data\\Ghamadi\\037135.mp3', 'array': array([ 0.00000000e+00, -2.86519935e-22, -2.56504911e-21, ...,
       -1.94239747e-02, -2.42924765e-02, -2.99104657e-02]), 'sampling_rate': 22050}, 'reciter': 'Ghamadi', 'transcription': 'الا عجوز ا في الغبرين', 'line': 3923, 'chapter': 37, 'verse': 135, 'text': 'إِلَّا عَجُوزࣰ ا فِي ٱلۡغَٰبِرِينَ'}
Traceback (most recent call last):
  File "C:\Users\Nawaz-Server\Documents\ml\myest.py", line 5, in <module>
    for data in dataset["train"]:
                ~~~~~~~^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\datasets\arrow_dataset.py", line 2372, in __iter__
    formatted_output = format_table(
                       ^^^^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\datasets\formatting\formatting.py", line 639, in format_table
    return formatter(pa_table, query_type=query_type)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\datasets\formatting\formatting.py", line 403, in __call__
    return self.format_row(pa_table)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\datasets\formatting\formatting.py", line 444, in format_row
    row = self.python_features_decoder.decode_row(row)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\datasets\formatting\formatting.py", line 222, in decode_row
    return self.features.decode_example(row) if self.features else row
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\datasets\features\features.py", line 2042, in decode_example
    column_name: decode_nested_example(feature, value, token_per_repo_id=token_per_repo_id)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\datasets\features\features.py", line 1403, in decode_nested_example
    return schema.decode_example(obj, token_per_repo_id=token_per_repo_id)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\datasets\features\audio.py", line 184, in decode_example
    array, sampling_rate = sf.read(f)
                           ^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\soundfile.py", line 285, in read
    with SoundFile(file, 'r', samplerate, channels,
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\soundfile.py", line 658, in __init__
    self._file = self._open(file, mode_int, closefd)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\soundfile.py", line 1216, in _open
    raise LibsndfileError(err, prefix="Error opening {0!r}: ".format(self.name))
soundfile.LibsndfileError: Error opening <_io.BufferedReader name='C:\\Users\\Nawaz-Server\\.cache\\huggingface\\hub\\datasets--fawazahmed0--bug-audio\\snapshots\\fab1398431fed1c0a2a7bff0945465bab8b5daef\\data\\Ghamadi\\037136.mp3'>: Format not recognised.

Expected behavior

Everything should work fine, as loading the problematic audio file directly with soundfile package works fine

code:

import soundfile as sf

print(sf.read('C:\\Users\\Nawaz-Server\\.cache\\huggingface\\hub\\datasets--fawazahmed0--bug-audio\\snapshots\\fab1398431fed1c0a2a7bff0945465bab8b5daef\\data\\Ghamadi\\037136.mp3'))

output:

(mypy) C:\Users\Nawaz-Server\Documents\ml>python myest.py
[C:\vcpkg\buildtrees\mpg123\src\0d8db63f9b-3db975bc05.clean\src\libmpg123\layer3.c:INT123_do_layer3():1801] error: dequantization failed!
(array([ 0.00000000e+00, -8.43723821e-22, -2.45370628e-22, ...,
       -7.71464454e-03, -6.90496899e-03, -8.63333419e-03]), 22050)

Environment info

datasets version: 3.0.2
Platform: Windows-11-10.0.22621-SP0
Python version: 3.12.7
huggingface_hub version: 0.26.2
PyArrow version: 17.0.0
Pandas version: 2.2.3
fsspec version: 2024.10.0
soundfile: 0.12.1

lhoestq commented 2 weeks ago

Hi ! can you try if this works ?

import soundfile as sf

with open('C:\\Users\\Nawaz-Server\\.cache\\huggingface\\hub\\datasets--fawazahmed0--bug-audio\\snapshots\\fab1398431fed1c0a2a7bff0945465bab8b5daef\\data\\Ghamadi\\037136.mp3', 'rb') as f:
    print(sf.read(f))

fawazahmed0 commented 2 weeks ago

@lhoestq Same error, here is the output:

(mypy) C:\Users\Nawaz-Server\Documents\ml>python myest.py
Traceback (most recent call last):
  File "C:\Users\Nawaz-Server\Documents\ml\myest.py", line 5, in <module>
    print(sf.read(f))
          ^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\soundfile.py", line 285, in read
    with SoundFile(file, 'r', samplerate, channels,
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\soundfile.py", line 658, in __init__
    self._file = self._open(file, mode_int, closefd)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Nawaz-Server\.conda\envs\mypy\Lib\site-packages\soundfile.py", line 1216, in _open
    raise LibsndfileError(err, prefix="Error opening {0!r}: ".format(self.name))
soundfile.LibsndfileError: Error opening <_io.BufferedReader name='C:\\Users\\Nawaz-Server\\.cache\\huggingface\\hub\\datasets--fawazahmed0--bug-audio\\snapshots\\fab1398431fed1c0a2a7bff0945465bab8b5daef\\data\\Ghamadi\\037136.mp3'>: Format not recognised.

fawazahmed0 commented 1 week ago

upstream bug: https://github.com/bastibe/python-soundfile/issues/439

huggingface / datasets

Accessing audio dataset value throws Format not recognised error #7276

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info