shizhediao / R-Tuning

[NAACL 2024 Outstanding Paper] Source code for the NAACL 2024 paper entitled "R-Tuning: Instructing Large Language Models to Say 'I Don't Know'"
https://arxiv.org/abs/2311.09677
75 stars 4 forks source link

Training configuration and hardware spec #8

Open bangawayoo opened 3 months ago

bangawayoo commented 3 months ago

Hi! Congratulation on a very interesting work and thank you for releasing the code :)

I am running some experiments and would like to reproduce some results. I had some questions regarding the training configurations.

  1. I assume you did full finetuning when reading the instructions. Would you confirm this?

  2. When training the 7B model using LMFlow, I am faced with CPU OOM with a server with 220GB RAM. I believe this is abnormal and may be a problem on my side. If you recall how many CPU memory were required, can you tell me?

  3. Which LLaMA weights did you use? If you used the ones in hugginface, can you tell me the repo id?

Thanks.

hanningzhang commented 3 months ago

Thank you for your questions.

  1. Yes, we are doing full finetuning for all the models.
  2. Finetuning 7B models usually consumes about 215GB CPU memory. Therefore, it may be hard to finetune with 220GB memory. You may try ZeRO-2 if ZeRO-3 is consuming too much CPU memory.
  3. We are using huggyllama/llama-7b, huggyllama/llama-13b, and openlm-research/open_llama_3b on HuggingFace.
bangawayoo commented 2 months ago

Thanks for the reply. I got full finetuning possible by using Zero2 without offload!

bangawayoo commented 2 months ago

Hi, I am trying to replicate the results following your reply, but I still need some help.

All the models were trained for 1 epoch using full-finetuning on lr=2r-5. For the 7b model on ParaRel-ID, I obtained an AP score of 0.84.

For the 3b model on the same dataset, the results were closer to that of the paper, with a score of .90.

Oddly, the data distribution obtained from the supervised identification strategy (Figure 6) seemed correct for the 3b model but slightly off for the 7b model. For the 7b-ParaRel, I obtained 40.4% of certain data, which is slightly lower than the 42% reported in the figure.

To estimate the confidence, the paper mentions a weighted average of the "{sure, unsure}" token probability and the token probability of the answer prediction. The calculate_ap.py uses an average (0.5sample[1] + 0.5sample[2]). Is this the correct implementation?

Can you have a guess what might be the cause? I really appreciate the help!

bangawayoo commented 2 weeks ago

@hanningzhang, @shizhediao Hi, do you have any updates on this?

hanningzhang commented 2 weeks ago

Thank you for your follow-up questions and sorry for the late reply.

For the first question, there is some randomness inside. For example, the max_new_tokens setting may affect whether the generated answer contains ground truth or not. And based on our experiments, 1% difference for the distribution will not make much difference.

For the second question, yes we use the 0.5 0.5 as the weight for our result. And we find this combination is effective for ranking.