Open wetdog opened 1 year ago
Hi ! The audio data don't always exist as files on disk - the blobs are often stored in the Arrow files. For now I'd suggest disabling decoding with .cast_column("audio", Audio(decode=False))
and apply your own decoding that handles corrupted files (maybe to filter them out ?)
cc @sanchit-gandhi since it's related to our discussion about allowing users to make decoding return None
and show a warning when there are corrupted files
Thanks @lhoestq, I wasn't aware of the decode flag. It makes more sense as you say to show a warning when there are corrupted files together with some metadata of the file that allows to filter them from the dataset.
My workaround was to catch the LibsndfileError and generate a dummy audio with an unsual sample rate to filter it later. However returning None
seems better.
try: array, sampling_rate = sf.read(file) except sf.LibsndfileError: print("bad file") array = np.array([0.0]) sampling_rate = 99.000
Feature request
Return the audio filename when the audio decoding fails. Although currently there are some checks for mp3 and opus formats with the library version there are still cases when the audio decoding could fail, eg. Corrupt file.
Motivation
When you try to load an object file dataset and the decoding fails you can't know which file is corrupt
Your contribution
Make a PR to Add exceptions for LIbsndfileError to return the audio filename or path when soundfile decoding fails.