huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.31k stars 2.7k forks source link

load_dataset broken in 2.21.0 #7107

Closed anjor closed 3 months ago

anjor commented 3 months ago

Describe the bug

eval_set = datasets.load_dataset("tatsu-lab/alpaca_eval", "alpaca_eval_gpt4_baseline", trust_remote_code=True) used to work till 2.20.0 but doesn't work in 2.21.0

In 2.20.0: Screenshot 2024-08-16 at 3 57 10 PM

in 2.21.0: Screenshot 2024-08-16 at 3 57 24 PM

Steps to reproduce the bug

  1. Spin up a new google collab
  2. pip install datasets==2.21.0
  3. import datasets
  4. eval_set = datasets.load_dataset("tatsu-lab/alpaca_eval", "alpaca_eval_gpt4_baseline", trust_remote_code=True)
  5. Will throw an error.

Expected behavior

Try steps 1-5 again but replace datasets version with 2.20.0, it will work

Environment info

anjor commented 3 months ago

There seems to be a PR related to the load_dataset path that went into 2.21.0 -- https://github.com/huggingface/datasets/pull/6862/files

Taking a look at it now

CShorten commented 3 months ago

+1

Downgrading to 2.20.0 fixed my issue, hopefully helpful for others.

anjor commented 3 months ago

I tried adding a simple test to test_load.py with the alpaca eval dataset but the test didn't fail :(.

So looks like this might have something to do with the environment?

albertvillanova commented 3 months ago

There was an issue with the script of the "tatsu-lab/alpaca_eval" dataset.

I was fixed with this PR:

It should work now if you retry to load the dataset.