load_dataset broken in 2.21.0

anjor commented 3 months ago

Describe the bug

eval_set = datasets.load_dataset("tatsu-lab/alpaca_eval", "alpaca_eval_gpt4_baseline", trust_remote_code=True) used to work till 2.20.0 but doesn't work in 2.21.0

In 2.20.0: Screenshot 2024-08-16 at 3 57 10 PM

in 2.21.0: Screenshot 2024-08-16 at 3 57 24 PM

Steps to reproduce the bug

Spin up a new google collab
pip install datasets==2.21.0
import datasets
eval_set = datasets.load_dataset("tatsu-lab/alpaca_eval", "alpaca_eval_gpt4_baseline", trust_remote_code=True)
Will throw an error.

Expected behavior

Try steps 1-5 again but replace datasets version with 2.20.0, it will work

Environment info

datasets version: 2.21.0
Platform: Linux-6.1.85+-x86_64-with-glibc2.35
Python version: 3.10.12
huggingface_hub version: 0.23.5
PyArrow version: 17.0.0
Pandas version: 2.1.4
fsspec version: 2024.5.0

anjor commented 3 months ago

There seems to be a PR related to the load_dataset path that went into 2.21.0 -- https://github.com/huggingface/datasets/pull/6862/files

Taking a look at it now

CShorten commented 3 months ago

+1

Downgrading to 2.20.0 fixed my issue, hopefully helpful for others.

anjor commented 3 months ago

I tried adding a simple test to test_load.py with the alpaca eval dataset but the test didn't fail :(.

So looks like this might have something to do with the environment?

albertvillanova commented 3 months ago

There was an issue with the script of the "tatsu-lab/alpaca_eval" dataset.

I was fixed with this PR:

Fix FileNotFoundError

It should work now if you retry to load the dataset.

huggingface / datasets