mdeff / fma

FMA: A Dataset For Music Analysis
https://arxiv.org/abs/1612.01840
MIT License
2.2k stars 432 forks source link

Corrupted files in fma_small #70

Open JakubK opened 6 months ago

JakubK commented 6 months ago

Im not sure if it's the right call, but I have encountered issues with some samples when working on fma_small

Reproduction:

corrupted_indicies = []
for i, audio_id in tqdm(enumerate(train)):
    try:
      # Load audio file
      y, sr = librosa.load(get_audio_path(AUDIO_DIR, audio_id))
    except:
      print("There was a problem with ", audio_id)
      corrupted_indicies.append(i)

Where train variable holds IDs of all fma_small samples labelled as "train". For some samples librosa.load fails to load:

y, sr = librosa.load(get_audio_path(AUDIO_DIR, 133297))

Produces:

LibsndfileError                           Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/librosa/core/audio.py](https://localhost:8080/#) in load(path, sr, mono, offset, duration, dtype, res_type)
    174         try:
--> 175             y, sr_native = __soundfile_load(path, offset, duration, dtype)
    176 

7 frames
LibsndfileError: Error opening 'fma_small/133/133297.mp3': File does not exist or is not a regular file (possibly a pipe?).

During handling of the above exception, another exception occurred:

NoBackendError                            Traceback (most recent call last)
<decorator-gen-119> in __audioread_load(path, offset, duration, dtype)

[/usr/local/lib/python3.10/dist-packages/audioread/__init__.py](https://localhost:8080/#) in audio_open(path, backends)
    130 
    131     # All backends failed!
--> 132     raise NoBackendError()

NoBackendError:

When I check my colab session, I can see that the mp3 file is actually present in the specified location. Downloaded file is surprisingly small, and playing this on my audio player, crashes it.

Problem does not occur for most of the files. Test and validation subsets are clean.

Problematic Ids that I have spotted:

133297, 108925, 99134

johndpope commented 5 months ago

related https://github.com/andsfonseca/text-to-music/blob/b0775d4726a904c961296ccd13f7e160a92df510/src/datasets/fma_dataset.py#L28

allispaul commented 4 months ago

These are known issues, see here.