Differences in Model Performance When Reproducing Experiment

fannie1208 commented 1 month ago

Hi, thank you for your nice work!

I'm reproducing the results in Table 2, using Mistral-7B model on MMLU and TydiQA and select 5% data.

I adhere to the scripts in your repo to conduct the warmup, data selection and training, and use the evaluation code in your repo to evaluate. I do not change any settings in your script, though only use a random seed of 3.

Despite following these settings, the performance of my model is worse than the results in Table 2. For MMLU, the performance of Random is 58.3 (60.0 in your paper), LESS is 60.8 (61.8 in your paper). For TydiQA, the f1 of Random is 44.6, LESS is 55.1.

My environments are: torch 2.4.0, transformers 4.45.2, peft 0.13.1, datasets 3.0.1

Are these differences reasonable? Could you please confirm if the settings in your scripts are fully aligned with those used in your paper?

Thanks.

cafeii commented 1 month ago

I'm not the author. I'm also reproducing this paper.

I'm quite confused about their default settings in their scripts. If you changed nothing in their scripts, you would build a gradient datastore with 200 sample for each dataset in step 2, but for reproduction, you should use the full dataset (am I right?).

Would you like to share your results in step 2, 3, 4 (like logging info, gradient datastore size, or sth. else)? So that I can confirm where the problems are located.

fannie1208 commented 1 month ago

I'm not the author. I'm also reproducing this paper.

I'm quite confused about their default settings in their scripts. If you changed nothing in their scripts, you would build a gradient datastore with 200 sample for each dataset in step 2, but for reproduction, you should use the full dataset (am I right?).

Would you like to share your results in step 2, 3, 4 (like logging info, gradient datastore size, or sth. else)? So that I can confirm where the problems are located.

Hi, I used their script for calculating gradients but delete '--max_samples 200' so that it will build a gradient datastore for all data samples.

Besides, for the batchsize, I used the default setting in their scripts and got the ckpt 422, 845, 1268, 1688.

cafeii commented 1 month ago

I'm not the author. I'm also reproducing this paper. I'm quite confused about their default settings in their scripts. If you changed nothing in their scripts, you would build a gradient datastore with 200 sample for each dataset in step 2, but for reproduction, you should use the full dataset (am I right?). Would you like to share your results in step 2, 3, 4 (like logging info, gradient datastore size, or sth. else)? So that I can confirm where the problems are located.

Hi, I used their script for calculating gradients but delete '--max_samples 200' so that it will build a gradient datastore for all data samples.

Besides, for the batchsize, I used the default setting in their scripts and got the ckpt 422, 845, 1268, 1688.

I guess you probably run the script in a single GPU.

The author reported the bz=128, therefore the total steps of one epoch of 4 datasets should be 105, which is the CKPT=105 in the step 2 tutorial. According to the default settings in the scripts (per_device_train_batch_size=1 and gradient_accumulation_steps=32), they probably run these experiments in a 4-GPU environment. If you run it in a single GPU, the default total batch size is 32, which may make the result worse.

fannie1208 commented 1 month ago

I'm not the author. I'm also reproducing this paper. I'm quite confused about their default settings in their scripts. If you changed nothing in their scripts, you would build a gradient datastore with 200 sample for each dataset in step 2, but for reproduction, you should use the full dataset (am I right?). Would you like to share your results in step 2, 3, 4 (like logging info, gradient datastore size, or sth. else)? So that I can confirm where the problems are located.

Hi, I used their script for calculating gradients but delete '--max_samples 200' so that it will build a gradient datastore for all data samples. Besides, for the batchsize, I used the default setting in their scripts and got the ckpt 422, 845, 1268, 1688.

I guess you probably run the script in a single GPU.

The author reported the bz=128, therefore the total steps of one epoch of 4 datasets should be 105, which is the CKPT=105 in the step 2 tutorial. According to the default settings in the scripts (per_device_train_batch_size=1 and gradient_accumulation_steps=32), they probably run these experiments in a 4-GPU environment. If you run it in a single GPU, the default total batch size is 32, which may make the result worse.

Yes, I run it on a single GPU but I think only the batchsize shouldn't affect the result so much.

princeton-nlp / LESS

Differences in Model Performance When Reproducing Experiment #32