huggingface / alignment-handbook

Robust recipes to align language models with human and AI preferences
https://huggingface.co/HuggingFaceH4
Apache License 2.0
4.18k stars 354 forks source link

Clarification on dataset mixer #157

Open deep-diver opened 2 months ago

deep-diver commented 2 months ago

from the README from /scripts.

datasets_mixer:
    dataset_1: 0.5  # Use 50% of the training examples
    dataset_2: 0.66 # Use 66% of the training examples
    dataset_3: 0.10 # Use 10% of the training examples
dataset_splits:
- train_xxx         # The training splits to mix
- test_xxx          # The test splits to mix

From the comments, it looks like ONLY training samples from dataset_1, dataset_2, and dataset_3 are considered. There isn't explanation how each dataset contributes to the test_xxx split.

However, the actual implementation seems like searching the test_xxx split from all datasets specified:

https://github.com/huggingface/alignment-handbook/blob/70769f9e9ba41c7f08ba6c4ff3725441b68b7ca3/src/alignment/data.py#L225-L230

Could you please explain the relationships between multiple datasets and splits? Thank you.

shabie commented 2 months ago

From the comments, it looks like ONLY training samples from dataset_1, dataset_2, and dataset_3 are considered. There isn't explanation how each dataset contributes to the test_xxx split.

Each dataset should have a separate train and test splits. This is made clear in the docstring where the expecatation is that they start with train_ and test_ respectively. Now the percentages sample the fraction of all datapoints from the train split. The corresponding test dataset is taken in full since subsampling for validation seems pointless (unless validation is super expensive then yeah maybe).

If the confusion was that the datamixer automatically uses the "unused" part of the train split as a test dataset (like how sklearn allows us to do that) then no that doesn't happen here. I like it cuz it always keeps the test set away from being mistakenly used as training by just changing the percentages of the mix.

Anyhow, all this is based on my understanding of the code. Hope it helps or if I am wrong, please correct me :)

deep-diver commented 2 months ago

Thank you @shabie

I think it could be common to have a test dataset in a single repo while we could have training dataset from multiple sources.

At least this is my use-case. To do this, I ended up merging multiple datasets into a single one by myself. Just hoping it could be done in alignment handbook too.

JIElite commented 3 weeks ago

if we assign the mixed dataset to 0.0, what will happen on the test set?

deep-diver commented 3 weeks ago

@JIElite

AFAIK, the ratio doesn't have any impact on the test split.

JIElite commented 3 weeks ago

@deep-diver Thanks for reply So, it will also use test set for evaluation, right? even if we assign the mixed ratio to 0.0