mdeff / fma

FMA: A Dataset For Music Analysis
https://arxiv.org/abs/1612.01840
MIT License
2.2k stars 432 forks source link

erroneous ID3 tag info #27

Open ejhumphrey opened 6 years ago

ejhumphrey commented 6 years ago

I'm not sure if this relates to #4, but I've found that at least sox (on debian!) tries to parse out file duration using the reported bit-rate. Unfortunately for me, the reported bitrate is way wrong for at least ≈90 tracks (of the 100k+), and probably wrong for another couple hundred ... these particularly bad tracks claim to have bitrates in excess of "100M", which sox (at least) parses as bits per second. I'd point out that stereo 16bit wav is 1.4Mbps.

The list of suspicious file IDs is here, if anyone wants to double-check / confirm ... the extension is txt, but it's JSON formatted, key point to sox-reported bitrate.

More fortunately, removing all the ID3 tags fixes the issue. I'd propose perhaps exporting all ID3 tags to a static dump over the collection (per #4), and then removing all the ID3 tags to sanitize the collection.

mdeff commented 4 years ago

Thanks for the investigation @ejhumphrey.

I'd propose perhaps exporting all ID3 tags to a static dump over the collection (per #4), and then removing all the ID3 tags to sanitize the collection.

Seems like a good solution. Do you know of any other metadata that should be cleaned or removed to sanitize such audio collection?