huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.98k stars 2.62k forks source link

ValueError: Couldn't infer the same data file format for all splits. Got {'train': ('json', {}), 'validation': (None, {})} #6930

Open CLL112 opened 3 months ago

CLL112 commented 3 months ago

Describe the bug

When I run the code en = load_dataset("allenai/c4", "en", streaming=True), I encounter an error: raise ValueError(f"Couldn't infer the same data file format for all splits. Got {split_modules}") ValueError: Couldn't infer the same data file format for all splits. Got {'train': ('json', {}), 'validation': (None, {})}. However, running dataset = load_dataset('allenai/c4', streaming=True, data_files={'validation': 'en/c4-validation.00003-of-00008.json.gz'}, split='validation') works fine. What is the issue here?

Steps to reproduce the bug

run code: import os os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com' from datasets import load_dataset

en = load_dataset("allenai/c4", "en", streaming=True)

Expected behavior

Successfully loaded the dataset.

Environment info

xioatian1 commented 2 months ago

How do you solve it ?

zouhuigang commented 1 month ago

How do you solve it ?

Please check your Python environment and dataset version. I have just resolved the issue, which was caused by a Python environment switching error