Hi there, thanks for the great library! We have been using it a lot in torchtune and it's been a huge help for us.
Regarding the bug: the same call to load_dataset errors with ExpectedMoreSplits in 2.19.0 after working fine in 2.18.0. Full details given in the repro below.
Steps to reproduce the bug
On 2.18.0, things work fine:
# First clear the locally cached dataset
rm -r ~/.cache/huggingface/datasets/lvwerra___stack-exchange-paired
pip install "datasets==2.18.0"
python3
>>> from datasets import load_dataset
>>> dataset = load_dataset('lvwerra/stack-exchange-paired', split='train', data_dir='data/rl')
On 2.19.0, they do not:
# First clear the locally cached dataset
rm -r ~/.cache/huggingface/datasets/lvwerra___stack-exchange-paired
pip install "datasets==2.19.0"
python3
>>> from datasets import load_dataset
>>> dataset = load_dataset('lvwerra/stack-exchange-paired', split='train', data_dir='data/rl')
The stack trace I see from the 2.19.0 version of load_dataset can be seen here.
(Maybe unsurprising but) notably if I do not delete the cache first I am able to load the dataset successfully. So based on this I suspect the cause is somewhere in the download logic.
Describe the bug
Hi there, thanks for the great library! We have been using it a lot in torchtune and it's been a huge help for us.
Regarding the bug: the same call to
load_dataset
errors withExpectedMoreSplits
in 2.19.0 after working fine in 2.18.0. Full details given in the repro below.Steps to reproduce the bug
On 2.18.0, things work fine:
On 2.19.0, they do not:
The stack trace I see from the 2.19.0 version of load_dataset can be seen here.
(Maybe unsurprising but) notably if I do not delete the cache first I am able to load the dataset successfully. So based on this I suspect the cause is somewhere in the download logic.
Expected behavior
Download the dataset successfully :)
Environment info
datasets
version: 2.19.0huggingface_hub
version: 0.22.2fsspec
version: 2024.3.1