huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.02k stars 2.63k forks source link

Return the audio filename when decoding fails due to corrupt files #5947

Open wetdog opened 1 year ago

wetdog commented 1 year ago

Feature request

Return the audio filename when the audio decoding fails. Although currently there are some checks for mp3 and opus formats with the library version there are still cases when the audio decoding could fail, eg. Corrupt file.

Motivation

When you try to load an object file dataset and the decoding fails you can't know which file is corrupt


raise LibsndfileError(err, prefix="Error opening {0!r}: ".format(self.name))
soundfile.LibsndfileError: Error opening <_io.BytesIO object at 0x7f5ab7e38290>: Format not recognised.

Your contribution

Make a PR to Add exceptions for LIbsndfileError to return the audio filename or path when soundfile decoding fails.

lhoestq commented 1 year ago

Hi ! The audio data don't always exist as files on disk - the blobs are often stored in the Arrow files. For now I'd suggest disabling decoding with .cast_column("audio", Audio(decode=False)) and apply your own decoding that handles corrupted files (maybe to filter them out ?)

cc @sanchit-gandhi since it's related to our discussion about allowing users to make decoding return None and show a warning when there are corrupted files

wetdog commented 1 year ago

Thanks @lhoestq, I wasn't aware of the decode flag. It makes more sense as you say to show a warning when there are corrupted files together with some metadata of the file that allows to filter them from the dataset.

My workaround was to catch the LibsndfileError and generate a dummy audio with an unsual sample rate to filter it later. However returning None seems better.

try: array, sampling_rate = sf.read(file) except sf.LibsndfileError: print("bad file") array = np.array([0.0]) sampling_rate = 99.000