stanfordnlp / pyreft

ReFT: Representation Finetuning for Language Models
https://arxiv.org/abs/2404.03592
Apache License 2.0
947 stars 77 forks source link

[P1] How did you create the validation set for Commonsense reasoning hyperparameter tuning? #81

Closed Edenzzzz closed 1 month ago

Edenzzzz commented 1 month ago

The paper mentions taking the last 300 examples for math reasoning, but I can't seem to find that for Commonsense. Thanks!

frankaging commented 1 month ago
Screenshot 2024-05-09 at 11 57 40 AM

@Edenzzzz thanks for your question! our hyperparameter tuning process is described above, on pg. 22 Appendix C. we are using the last 300 examples from the GSM8K training set for hyperparameter tuning.

to translate this into code, you can view it at, https://github.com/stanfordnlp/pyreft/blob/main/examples/loreft/dataset.py#L109

    def postprocess(self, kwargs):
        original_dataset_size = len(self.task_dataset)
        if self.task in ["gsm8k"] and \
            self.original_data_split == "train" and self.test_split == "validation":
            self.task_dataset = self.task_dataset.select(
                range(original_dataset_size - 300))
        if self.task in ["gsm8k"] and self.original_data_split == "validation":
            self.task_dataset = self.task_dataset.select(
                range(original_dataset_size - 300, original_dataset_size))
        self.raw_dataset = self.task_dataset # also update the raw dataset pointer.
        return

we use the remaining (minus that 300 examples) as our training set, and evaluate our models on that hold out 300 examples.

i also updated our README to provide example script to run hyperparameter tuning: https://github.com/stanfordnlp/pyreft/blob/main/examples/loreft/README.md#hyperparameter-tuning

Edenzzzz commented 1 month ago

Thanks for your quick reply! I was also wondering if it'd be better to use the eval datasets of Commonsense reasoning (boolq, piqa etc.) instead of a math reasoning dataset.

frankaging commented 1 month ago

@Edenzzzz yes, i think it would be interesting to see the performance. there are two main reasons why we did this:

1) This ensures a somewhat fair comparison to LoRA/DoRA. LoRA parameters are tuned on the math reasoning datasets provided in the LLM-Adaptors paper. They tune on math reasoning datasets and test on commonsense reasoning. DoRA follows this paper. So, we thought it would be better to follow a similar paradigm.

2) We pick GSM8K train for our HT to ensure there is no data leakage. Based on my readings of LLM-Adaptors paper as well as their repo (correct me if i am wrong), they did their HT search on the actual math test dataset (i.e., selecting ranks, modules to apply LoRA, etc..), which is not ideal. To avoid this while keeping a similar searching paradigm, we pick GSM8K train (last 300 examples).

Besides these two practical reasonings, I would agree it is better to do separate tuning for all methods (which means a full HT search for all methods on commonsense reasoning dev). Given the time limit, we end up with the current solution.

frankaging commented 1 month ago

Closing this issue for now, feel free to reopen if there are new issues! Thanks!

Edenzzzz commented 1 month ago

Thanks for your detailed reply! I haven't found hyperparameter tuning details in the paper, but this might be what you meant.

image

They subsequently mentioned training on CommonSense 170K and Math10K, aligning with your paper.