When I run the code en = load_dataset("allenai/c4", "en", streaming=True), I encounter an error: raise ValueError(f"Couldn't infer the same data file format for all splits. Got {split_modules}") ValueError: Couldn't infer the same data file format for all splits. Got {'train': ('json', {}), 'validation': (None, {})}.
However, running dataset = load_dataset('allenai/c4', streaming=True, data_files={'validation': 'en/c4-validation.00003-of-00008.json.gz'}, split='validation') works fine. What is the issue here?
Steps to reproduce the bug
run code:
import os
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
from datasets import load_dataset
en = load_dataset("allenai/c4", "en", streaming=True)
Describe the bug
When I run the code en = load_dataset("allenai/c4", "en", streaming=True), I encounter an error: raise ValueError(f"Couldn't infer the same data file format for all splits. Got {split_modules}") ValueError: Couldn't infer the same data file format for all splits. Got {'train': ('json', {}), 'validation': (None, {})}. However, running dataset = load_dataset('allenai/c4', streaming=True, data_files={'validation': 'en/c4-validation.00003-of-00008.json.gz'}, split='validation') works fine. What is the issue here?
Steps to reproduce the bug
run code: import os os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com' from datasets import load_dataset
en = load_dataset("allenai/c4", "en", streaming=True)
Expected behavior
Successfully loaded the dataset.
Environment info
datasets
version: 2.18.0huggingface_hub
version: 0.22.2fsspec
version: 2024.2.0