huggingface / alignment-handbook

Robust recipes to align language models with human and AI preferences
https://huggingface.co/HuggingFaceH4
Apache License 2.0
4.55k stars 393 forks source link

Update docstring for `data.py` to reflect true behavior of `shuffle` parameter #60

Closed scottfleming closed 10 months ago

scottfleming commented 10 months ago

The docs state that the shuffle parameter in mix_datasets from data.py controls Whether to shuffle the training data, but then in the code if shuffle is set to True it also shuffles the test data. This small change makes the functionality consistent with the docstring. (If you instead want to keep the functionality the same, then we should update the docstring).

scottfleming commented 10 months ago

@lewtun thanks and sounds good! I've updated the code in the PR to revert back to the original version but updated the docstrings for both get_datasets and mix_datasets to make it clear that shuffle affects both train and test splits.

That being said, I would prefer if there were some way to control shuffling so that one could shuffle just the train data and not the test data. This ends up being important if you want to eg store the predictions and compare to another model. In those cases you have to cache the index in run_sft.py, for example, if you want the predictions to all be in the same order.

Happy to implement the PR if you agree. In the simplest case this would be something like adding a shuffle_test arg to get_datasets that is False by default. Then the eval set is not shuffled by default and there are no breaking changes to the API + the functionality would actually be in line with the original docstrings stated.

lewtun commented 10 months ago

Thanks for iterating @scottfleming ! And good point about wanting to keep the test set unshuffled - happy to have a PR for that :)

For the API, an alternative would be to have a new arg like shuffle_splits : List[str] = ["train", "test"] and then deprecate the shuffle arg.

scottfleming commented 10 months ago

Oh great point re:the new arg structure being an enumeration of which splits should be shuffled. I've got a busy few weeks ahead of me, but I'll try to revisit afterwards. Also making a note for myself that the seed parameter for shuffle should also be an arg with 42 as the default.