huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.29k stars 2.7k forks source link

Some of DownloadConfig's properties are always being overridden in load.py #7097

Open ductai199x opened 3 months ago

ductai199x commented 3 months ago

Describe the bug

The extract_compressed_file and force_extract properties of DownloadConfig are always being set to True in the function dataset_module_factory in the load.py file. This behavior is very annoying because data extracted will just be ignored the next time the dataset is loaded.

See this image below: image

Steps to reproduce the bug

  1. Have a local dataset that contains archived files (zip, tar.gz, etc)
  2. Build a dataset loading script to download and extract these files
  3. Run the load_dataset function with a DownloadConfig that specifically set force_extract to False
  4. The extraction process will start no matter if the archives was extracted previously

Expected behavior

The extraction process should not run when the archives were previously extracted and force_extract is set to False.

Environment info

datasets==2.20.0 python3.9