huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.2k stars 2.68k forks source link

ExpectedMoreSplits error on load_dataset when upgrading to 2.19.0 #6836

Open ebsmothers opened 6 months ago

ebsmothers commented 6 months ago

Describe the bug

Hi there, thanks for the great library! We have been using it a lot in torchtune and it's been a huge help for us.

Regarding the bug: the same call to load_dataset errors with ExpectedMoreSplits in 2.19.0 after working fine in 2.18.0. Full details given in the repro below.

Steps to reproduce the bug

On 2.18.0, things work fine:

# First clear the locally cached dataset
rm -r ~/.cache/huggingface/datasets/lvwerra___stack-exchange-paired
pip install "datasets==2.18.0"
python3
>>> from datasets import load_dataset
>>> dataset = load_dataset('lvwerra/stack-exchange-paired', split='train', data_dir='data/rl')

On 2.19.0, they do not:

# First clear the locally cached dataset
rm -r ~/.cache/huggingface/datasets/lvwerra___stack-exchange-paired
pip install "datasets==2.19.0"
python3
>>> from datasets import load_dataset
>>> dataset = load_dataset('lvwerra/stack-exchange-paired', split='train', data_dir='data/rl')

The stack trace I see from the 2.19.0 version of load_dataset can be seen here.

(Maybe unsurprising but) notably if I do not delete the cache first I am able to load the dataset successfully. So based on this I suspect the cause is somewhere in the download logic.

Expected behavior

Download the dataset successfully :)

Environment info

relic-yuexi commented 6 months ago

Get same error on same datasets too.

jxmsML commented 6 months ago

+1

whwhwwhh commented 5 months ago

same error