[P1] Unable to replicate results from paper for RoBERTa Base for Glue tasks like CoLa

m-dev12 commented 1 week ago

Using below configuration but unable to replicate paper results. Is there anything different that the authors have done in the paper? Got {'validation_matthews_correlation': 0.40104291315665774} finally instead of ~61%. Should SEED, or any other configuration be updated? Would be great if authors could share wandb logs for this as well. Thanks!

python train.py \ -task glue \ -data_dir ./data \ -train_dataset cola \ -eval_dataset cola \ -model FacebookAI/roberta-base \ -seed 42 \ -l all \ -r 1 \ -p f3 \ -e 60 \ -lr 4e-4 \ -type LoreftIntervention \ -batch_size 32 \ -output_dir ./output \ -schedule linear \ -wu 5e-3 \ -logging_steps 20 \ -allow_cls_grad \ -metric_for_best_model matthews_correlation \ -dropout 0.2 \ -test_split validation @frankaging

frankaging commented 1 week ago

@m-dev12 Hey! Thanks for the inputs! Here are some pointers about reproducing our results.

Firstly, could you run the following command and see if you can get up to ~64%? (the raw log for this run is attached in output.log)

python train.py \
-task glue -train_dataset cola -model FacebookAI/roberta-base \
-seed 45 \
-l all -r 1 -p f3 -e 60 -lr 4e-4 \
-type LoreftIntervention \
-gradient_accumulation_steps 1 \
-batch_size 32 -eval_batch_size 32 \
-test_split validation -max_length 256 \
--metric_for_best_model matthews_correlation \
--dropout 0.2 --weight_decay 0.00000 \
--warmup_ratio 0.005 --logging_steps 20 \
--allow_cls_grad

Note that here, we (1) use seed 45 which is one of the seeds we used; (2) -max_length 256. We will clarify the max length setting in our next revision in the paper. We follow this paper for the max length setting to ensure a fair comparison.

As noted in the paper (attached below, on pg. 23 of our paper), we use {42,43,44,45,46} for all GLUE datasets, but we replace a couple of seeds for RTE and CoLA due to instability. Due to this, you could report mean or median (we report this in Table 16 on pg. 31) to compare to our method or others.

Sorry, we have not organized our GLUE wandb logs. We might consider to rerun those and release a set of wandb logs later if we have time.

Let me know if these help! and let me know if you have more questions! Thanks!

m-dev12 commented 1 week ago

Thank you for the detailed response @frankaging. I did try out your command, and it gives me ~61.6% (attached logs), but it is still different than your logs at ~64%. Command used: train.py -task glue -train_dataset cola -model FacebookAI/roberta-base -seed 45 -l all -r 1 -p f3 -e 60 -lr 4e-4 -type LoreftIntervention -gradient_accumulation_steps 1 -batch_size 32 -eval_batch_size 32 -test_split validation -max_length 256 --metric_for_best_model matthews_correlation --dropout 0.2 --weight_decay 0.00000 --warmup_ratio 0.005 --logging_steps 20 --allow_cls_grad

As for my environment, I am using pyvene==0.1.1 and pyreft==0.0.6. Note that pyvene==0.1.2 did not work with Roberta(giving out an error for additional arg use_cache passed.) I believe it is related to this: https://github.com/stanfordnlp/pyvene/pull/152

Is there anything you recommend me checking, as I believe ideally I should be replicate your logs.

output (2).log

frankaging commented 1 week ago

@m-dev12 Thanks for the follow up!

I think it might boil down to different random state of the machine, which is hard to control, especially given how unstable it could be for datasets such as CoLA and RTE (e.g., even with the same random seed, there could be discrepancies across machines).

You can try to follow this ticket to create an exact same env as we have locally, and test out if it helps: https://github.com/stanfordnlp/pyreft/issues/102. Minor: note that ~61.6% is close to our number we reported in the paper (60.4) given the instability of the setup. I would also recommend to try different seeds e.g. 46/43/44, etc..

m-dev12 commented 1 week ago

Thanks again @frankaging!!. Yes, I understand and yes the results are close to the reported results in the paper. As for the environment, as I mentioned pyvene == 0.1.2 gives out an error with RoBERTa, "forward() got an unexpected keyword argument 'use_cache'".

This is related to this fix = https://github.com/stanfordnlp/pyvene/pull/152 If I comment out these changes then RoBERTa worked with 0.1.2. So, for now I just reverted to 0.1.1 pyvene version to run these experiments. I should probably create an issue in the pyvene github repo for this..

frankaging commented 1 week ago

@m-dev12 Thanks! Opening an issue would help a lot, as I am slowly ramping up my workload on these two repos again for the summer!

m-dev12 commented 1 week ago

Thanks @frankaging! I have opened an issue on Pyreft.

On a side note, could you please share commands with hyperparam config for other Glue tasks like MNLI, QNLI for RoBERTa, just in case there are any nuances not mentioned in the paper and since CoLA is a little unstable. For instance I am using this below config: python train.py \ -task glue -train_dataset mnli -model FacebookAI/roberta-base \ -seed 45 \ -l all -r 1 -p f1 -e 40 -lr 6e-4 \ -type LoreftIntervention \ -gradient_accumulation_steps 1 \ -batch_size 32 -eval_batch_size 32 \ -test_split validation_matched -max_length 256 \ --warmup_ratio 0.06 --logging_steps 20 \ --dropout 0.05 --weight_decay 0.00000 \ --allow_cls_grad \ --metric_for_best_model accuracy

Thanks again!.

frankaging commented 1 week ago

@m-dev12 Hey, thanks. The whole hyperparameter search space is outlined in Table 8 on pg. 25; Individual task hyperparameter configuration is outlined on the page that follows from Table 9 to Table 12. And we use 256 as our maximum sequence length for all tasks.

I double checked, and I think most of the nuances are mentioned already, but maximum sequence length is indeed missing (and probably the only one? i think). We wrote this on pg. 23 last paragraph, which causes confusion:

We follow Wu et al. [2024a]’s setting for evaluation.

Thus, we will add another sentence after this sentence to further clarify that we also use the same maximum sequence length which is 256. Our loreft folder also gives one example command that has the maximum sequence length set to 256.

(If you want to do hyperparameter search as we did) You can also follow our hyperparameter search procedure:

Use one single seed to do the tuning (e.g., 42 is what we used).
I don't think we exhaustively search all the combinations outlined in Table 8. We mainly conduct hyperparameter search by changing one parameter at a time. But, exhausting all the combinations might give extra performance boost.
If just for replication, you can skip Table 8, and use per task hyperparameters.

m-dev12 commented 1 week ago

@frankaging Sure, yes I have taken everything from the appendix of the paper, will double check any additional details from Wu et al (2024). Thanks a lot!

stanfordnlp / pyreft

[P1] Unable to replicate results from paper for RoBERTa Base for Glue tasks like CoLa #114