huggingface / alignment-handbook

Robust recipes to align language models with human and AI preferences
https://huggingface.co/HuggingFaceH4
Apache License 2.0
4.28k stars 367 forks source link

Allow loading datasets from disk using `load_from_disk` method. #53

Closed dmilcevski closed 8 months ago

dmilcevski commented 8 months ago

If a dataset is created with ds.save_to_disk(), load_dataset() fails to load the local dataset and throws errors. The solution could be either providing a custom loading script, or maybe using load_from_disk method as proposed in this PR. The dataset is stored using .arrow format and the splits are stored in separate folders (E.g.: ds.save_to_disk(os.path.join(path, dataset_name))).

Starting the training later is done by adapting the config files with:

dataset_mixer:
  /path/to/my/dataset/: 1.0
dataset_splits:
- train_sft
- test_sft
mathis-lambert commented 8 months ago

Nice !! I've made it locally before seeing your PR it works well indeed ^^