huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.19k stars 2.68k forks source link

TIMIT won't load after manual download: Errors about files that don't exist #4439

Closed drscotthawley closed 2 years ago

drscotthawley commented 2 years ago

Describe the bug

I get the message from HuggingFace that it must be downloaded manually. From the URL provided in the message, I got to UPenn page for manual download. (UPenn apparently want $250? for the dataset??) ...So, ok, I obtained a copy from a friend and also a smaller version from Kaggle. But in both cases the HF dataloader fails; it is looking for files that don't exist anywhere in the dataset: it is looking for files with lower-case letters like "*test" (all the filenames in both my copies are uppercase) and certain file extensions that exclude the .DOC which is provided in TIMIT:

Steps to reproduce the bug

data = load_dataset('timit_asr', 'clean')['train']

Expected results

The dataset should load with no errors.

Actual results

This error message:

  File "/home/ubuntu/envs/data2vec/lib/python3.9/site-packages/datasets/data_files.py", line 201, in resolve_patterns_locally_or_by_urls
    raise FileNotFoundError(error_msg)
FileNotFoundError: Unable to resolve any data file that matches '['**test*', '**eval*']' at /home/ubuntu/datasets/timit with any supported extension ['csv', 'tsv', 'json', 'jsonl', 'parquet', 'txt', 'blp', 'bmp', 'dib', 'bufr', 'cur', 'pcx', 'dcx', 'dds', 'ps', 'eps', 'fit', 'fits', 'fli', 'flc', 'ftc', 'ftu', 'gbr', 'gif', 'grib', 'h5', 'hdf', 'png', 'apng', 'jp2', 'j2k', 'jpc', 'jpf', 'jpx', 'j2c', 'icns', 'ico', 'im', 'iim', 'tif', 'tiff', 'jfif', 'jpe', 'jpg', 'jpeg', 'mpg', 'mpeg', 'msp', 'pcd', 'pxr', 'pbm', 'pgm', 'ppm', 'pnm', 'psd', 'bw', 'rgb', 'rgba', 'sgi', 'ras', 'tga', 'icb', 'vda', 'vst', 'webp', 'wmf', 'emf', 'xbm', 'xpm', 'zip']

But this is a strange sort of error: why is it looking for lower-case file names when all the TIMIT dataset filenames are uppercase? Why does it exclude .DOC files when the only parts of the TIMIT data set with "TEST" in them have ".DOC" extensions? ...I wonder, how was anyone able to get this to work in the first place?

The files in the dataset look like the following:

³       PHONCODE.DOC
³       PROMPTS.TXT
³       SPKRINFO.TXT
³       SPKRSENT.TXT
³       TESTSET.DOC

...so why are these being excluded by the dataset loader?

Environment info

albertvillanova commented 2 years ago

To have some context, please see:

Please, also note that we have recently made some fixes to the script, which are in our GitHub master branch but not yet released:

drscotthawley commented 2 years ago

Thanks Albert! I'll try pulling datasets from the git repo instead of PyPI, and/or just wait for the next release.

albertvillanova commented 2 years ago

I'm closing this issue then. Please, feel free to reopen it again if the problem persists.