mdeff / fma

FMA: A Dataset For Music Analysis
https://arxiv.org/abs/1612.01840
MIT License
2.2k stars 432 forks source link

Corrupted files in FMA Large #49

Closed keunwoochoi closed 3 years ago

keunwoochoi commented 3 years ago

I didn't double check, but I couldn't open files with these indices on linux/ffmpeg/librosa. Just wanted to share so that others would get some hints.

2624,
3284,
8669,
10116,
11583,
12838,
13529,
14116,
14180,
20814,
22554,
23429,
23430,
23431,
25173,
25174,
25175,
25176,
25180,
29345,
29346,
29352,
29356,
33411,
33413,
33414,
33417,
33418,
33419,
33425,
35725,
39363,
41745,
42986,
43753,
50594,
50782,
53668,
54569,
54582,
61480,
61822,
63422,
63997,
72656,
72980,
73510,
80553,
82699,
84503,
84504,
84522,
84524,
86656,
86659,
86661,
86664,
87057,
90244,
90245,
90247,
90248,
90250,
90252,
90253,
90442,
90445,
91206,
92479,
94052,
94234,
95253,
96203,
96207,
96210,
98105,
98562,
101265,
101272,
101275,
102241,
102243,
102247,
102249,
102289,
106409,
106412,
106415,
106628,
108920,
109266,
110236,
115610,
117441,
127928,
129207,
129800,
130328,
130748,
130751,
131545,
133641,
133647,
134887,
140449,
140450,
140451,
140452,
140453,
140454,
140455,
140456,
140457,
140458,
140459,
140460,
140461,
140462,
140463,
140464,
140465,
140466,
140467,
140468,
140469,
140470,
140471,
140472,
142614,
144518,
144619,
145056,
146056,
147419,
147424,
148786,
148787,
148788,
148789,
148790,
148791,
148792,
148793,
148794,
148795,
151920,
155051,
mdeff commented 3 years ago

Thanks for reporting. I've checked some (002/002624.mp3, 084/084522.mp3, 101/101265.mp3, 140/140449.mp3, 148/148795.mp3) and could open them with librosa (v0.8.0) and ffmpeg (v4.3.1), and listen to them with mpv (v0.32.0), also on Linux. What did you try exactly?

Do also check that your local copy isn't corrupted. You can get a checksum of an audio file as sha1sum 002/002624.mp3 then check that it corresponds to what is recorded in the checksums file. Or check them all with sha1sum -c checksums.

mdeff commented 3 years ago

Also, when I created the dataset, I did extract features (with librosa) from all tracks in fma_full.

keunwoochoi commented 3 years ago

Hm, this is interesting. I can't really figure it out and ended up ignoring those files. Anyway..

$ sha1sum 002/002624.mp3
5e421474f0cbcf35648753fe1fd3cc22788d1bbe  002/002624.mp3
fma_large $ grep 002/002624.mp3 checksums
5e421474f0cbcf35648753fe1fd3cc22788d1bbe  002/002624.mp3

So the file is correct.

$ ffmpeg
ffmpeg version 4.1.6-1~deb10u1 Copyright (c) 2000-2020 the FFmpeg developers
  built with gcc 8 (Debian 8.3.0-6)
  configuration: --prefix=/usr --extra-version='1~deb10u1' --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy--enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opengl --enable-sdl2 --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
  libavutil      56. 22.100 / 56. 22.100
  libavcodec     58. 35.100 / 58. 35.100
  libavformat    58. 20.100 / 58. 20.100
  libavdevice    58.  5.100 / 58.  5.100
  libavfilter     7. 40.101 /  7. 40.101
  libavresample   4.  0.  0 /  4.  0.  0
  libswscale      5.  3.100 /  5.  3.100
  libswresample   3.  3.100 /  3.  3.100
  libpostproc    55.  3.100 / 55.  3.100
Hyper fast Audio and Video encoder
usage: ffmpeg [options] [[infile options] -i infile]... {[outfile options] outfile}...

Use -h to get full help or, even better, run 'man ffmpeg'

ffmpeg is installed

$ ls -l 002/002624.mp3
-r--r--r-- 1 <REDACTED> 1563 Apr  1  2017 002/002624.mp3

1563 Byte seems very small...

And it's the error in Python.

$ python3
Python 3.7.3 (default, Apr  3 2019, 05:39:12)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import librosa
>>> _ = librosa.load('002/002624.mp3')
[REDACTED]/python3.7/site-packages/librosa/core/audio.py:162: UserWarning: PySoundFile failed. Trying audioread instead.
  warnings.warn("PySoundFile failed. Trying audioread instead.")
Traceback (most recent call last):
  File "[REDACTED]/python3.7/site-packages/librosa/core/audio.py", line 146, in load
    with sf.SoundFile(path) as sf_desc:
  File "[REDACTED]/python3.7/site-packages/soundfile.py", line 629, in __init__
    self._file = self._open(file, mode_int, closefd)
  File "[REDACTED]/python3.7/site-packages/soundfile.py", line 1184, in _open
    "Error opening {0!r}: ".format(self.name))
  File "[REDACTED]/python3.7/site-packages/soundfile.py", line 1357, in _error_check
    raise RuntimeError(prefix + _ffi.string(err_str).decode('utf-8', 'replace'))
RuntimeError: Error opening '002/002624.mp3': File contains data in an unknown format.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "[REDACTED]/python3.7/site-packages/librosa/core/audio.py", line 163, in load
    y, sr_native = __audioread_load(path, offset, duration, dtype)
  File "[REDACTED]/python3.7/site-packages/librosa/core/audio.py", line 187, in __audioread_load
    with audioread.audio_open(path) as input_file:
  File "[REDACTED]/python3.7/site-packages/audioread/__init__.py", line 116, in audio_open
    raise NoBackendError()
audioread.exceptions.NoBackendError

FYI, I have libsndfile1 installed in the machine.

mdeff commented 3 years ago

My bad, I was checking fma_full instead of fma_large... I can reproduce. It's a known issue, but we didn't have a list for fma_large yet. I've added yours. Thanks!

I think the list is however incomplete, as it should be a superset of the fma_small and fma_medium lists. For example, I get the same issue with 099/099134.mp3 which is not in your list. Don't you?

keunwoochoi commented 3 years ago

no problem! and you’re right, I only included the files that only exist in FMA Large. I assumed all the files in FMA small and medium are in FMA large so in my code, I ignore all the corrupted files in FMA small, medium, and large.

mdeff commented 3 years ago

I see, so we now have complete lists for the three subsets. Thanks!

mdeff commented 3 years ago

Does the list also contains tracks that are shorter than 30s but load fine? Or don't you ignore those?

keunwoochoi commented 3 years ago

It probably doesn't contain those files. The list contains files that I had error when trying to load the audio file.

mdeff commented 3 years ago

Ok, thanks for confirming.