tlc4418 / llm_optimization

A repo for RLHF training and BoN over LLMs, with support for reward model ensembles.
https://arxiv.org/abs/2310.02743
MIT License
25 stars 1 forks source link

Unknown split error when loading RM dataset `tlc4418/1.4b-policy_preference_data_gold_labelled` #11

Closed JohannesAck closed 1 month ago

JohannesAck commented 1 month ago

Hi, I've been using this project and it's still very helpful, thanks a lot! Since there's been more activity in the issues lately, I'd like to document a quick fix for other users:

When loading the gold-labeled preference dataset, it crashes with the following error:

In [2]: load_dataset('tlc4418/1.4b-policy_preference_data_gold_labelled', split=["train", "validation"])
[...]
File /usr/local/lib/python3.10/dist-packages/datasets/arrow_reader.py:480, in _rel_to_abs_instr(rel_instr, name2len)
    478 split = rel_instr.splitname
    479 if split not in name2len:
--> 480     raise ValueError(f'Unknown split "{split}". Should be one of {list(name2len)}.')
    481 num_examples = name2len[split]
    482 from_ = rel_instr.from_

ValueError: Unknown split "validation". Should be one of ['train'].

This is caused by the file structure of the tlc4418/1.4b-policy_preference_data_gold_labelled dataset.

The original folder structure is

1.4b-policy_preference_data_gold_labelled/
├── README.md
├── train
│   ├── human_pref.json
│   ├── sft.json
│   ├── synth_pref.json
│   └── unlabelled.json
└── val.json

Downloading the dataset and putting val.json in its own subfolder fixes this issue, i.e.:

1.4b-policy_preference_data_gold_labelled/
├── README.md
├── train
│   ├── human_pref.json
│   ├── sft.json
│   ├── synth_pref.json
│   └── unlabelled.json
└── validation
    └── val.json
In [15]: train, val = load_dataset('1.4b-policy_preference_data_gold_labelled', split=["train","validation"])
Repo card metadata block was not found. Setting CardData to empty.
In [16]: 

The huggingface pullrequest API doesn't work for me, so I'll just leave this here as an issue for people who might encounter the same bug.

tlc4418 commented 1 month ago

Thank you for raising this issue! I have updated the dataset on Huggingface to fix this, let me know if you still encounter issues