[P1] Experimental setup for instruction following experiments in the ReFT paper

stanfordnlp / pyreft

ReFT: Representation Finetuning for Language Models

https://arxiv.org/abs/2404.03592

Apache License 2.0

947 stars 77 forks source link

[P1] Experimental setup for instruction following experiments in the ReFT paper #107

Closed savadikarc closed 3 weeks ago

savadikarc commented 3 weeks ago

Thanks for this awesome library, and for the example scripts!

I have a couple of very basic questions about the experimental setup for the instruction following experiments in the ReFT paper. The paper mentions that the hyperparameter tuning is done by using the Alpaca-52k dataset for training, the final runs use Ultrafeedback for training, and uses the prompt template from https://github.com/tatsu-lab/stanford_alpaca.

Is the same template used when training the models using Alpaca-52k (during hyperparameter tuning)? I ask this because the example scripts in examples/alpaca does not seem to use this template in its Dataset and Dataloader, but the scripts in examples/loreft do (I may be missing something).
Is the hyperparameter tuning done using Llama 2 or LLaMA-1? The main results (Table 3) mention Llama 2, but the Appendix (Appendix D.1 and Table 7) says LLaMA-1.

frankaging commented 3 weeks ago

Hey @savadikarc! Thanks for your questions.

For results in the paper, please only use scripts under the examples/loreft folder, for hyperparameter tuning as well as final results! For Alpaca-Eval v1.0, we are using LLaMA-1 for hyperparameter tuning, and applying to Llama-2 so that we are not overfitting our hyperparameters. examples/alpaca is a minimal example for generic Alpaca-like finetuning.

For hyperparameter tuning with Alpaca-52K, we have this in our task config: https://github.com/stanfordnlp/pyreft/blob/main/examples/loreft/task_config.py#L55

You can specify this task in the args to load datasets for training etc.. Let me know if you have other questions!

savadikarc commented 3 weeks ago

Thanks for clarifying! I got confused because when I ran

python train.py -task alpaca \
    -data_dir dataset \
    -model meta-llama/Llama-2-7b-hf \
    -seed 42 -l "3;9;18;24" -r 4 -p f5+l5 -e 9 -lr 9e-4 \
    -type LoreftIntervention \
    -gradient_accumulation_steps 32 \
    -batch_size 4 \
    -eval_batch_size 2 \
    --test_split test \
    --use_normalized_template \
    --max_length 768

I got an error datasets.exceptions.DatasetNotFoundError: Dataset 'alpaca' doesn't exist on the Hub or cannot be accessed. Assuming that the code uses the cleaned alpaca data from https://github.com/aryamanarora/LLM-Adapters/blob/main/ft-training_set/alpaca_data_cleaned.json, minor changes solved the issue. I've created a pull request with the changes here: https://github.com/stanfordnlp/pyreft/pull/108

frankaging commented 3 weeks ago

@savadikarc Thanks for catching this! It seems like when we refactored the dataset loading classes, alpaca is missed.