Closed scottfleming closed 10 months ago
@lewtun thanks and sounds good! I've updated the code in the PR to revert back to the original version but updated the docstrings for both get_datasets
and mix_datasets
to make it clear that shuffle
affects both train and test splits.
That being said, I would prefer if there were some way to control shuffling so that one could shuffle just the train data and not the test data. This ends up being important if you want to eg store the predictions and compare to another model. In those cases you have to cache the index in run_sft.py
, for example, if you want the predictions to all be in the same order.
Happy to implement the PR if you agree. In the simplest case this would be something like adding a shuffle_test
arg to get_datasets
that is False
by default. Then the eval set is not shuffled by default and there are no breaking changes to the API + the functionality would actually be in line with the original docstrings stated.
Thanks for iterating @scottfleming ! And good point about wanting to keep the test set unshuffled - happy to have a PR for that :)
For the API, an alternative would be to have a new arg like shuffle_splits : List[str] = ["train", "test"]
and then deprecate the shuffle
arg.
Oh great point re:the new arg structure being an enumeration of which splits should be shuffled. I've got a busy few weeks ahead of me, but I'll try to revisit afterwards. Also making a note for myself that the seed parameter for shuffle
should also be an arg with 42
as the default.
The docs state that the
shuffle
parameter inmix_datasets
fromdata.py
controlsWhether to shuffle the training data
, but then in the code ifshuffle
is set toTrue
it also shuffles the test data. This small change makes the functionality consistent with the docstring. (If you instead want to keep the functionality the same, then we should update the docstring).