Open shaswati1 opened 4 months ago
In eval_log_forget.json
, "generated_text"
key has is a list of 3-sized lists. The 3rd element is just the original answer and the 2nd is the one you should be looking for. The second is the one generated by the unlearned model. I have not run grad_ascent
but ran grad_diff
and got the below result.
["Question: What is the full name of the geology author born in Karachi, Pakistan on 06/30/1975?\n", "Answer and and and and and and and and and and and and and and and and and and the and and and and and and and and and the and and and and and and the and and and and and the and and and and and and the and and and and the and and and and the and and and and the and and and the and and and the and and and the and and and the and and and the and and and the and and and and the and and and the and and and the and and and and the and and and the and and and the and and and the and and and the and and and the and and and the and and and the and and and the and and the and and the and and and the and and the and and the and and the and and the and and the and and the and and the and and the and", "The author's name is Hina Ameen." ],
In
eval_log_forget.json
,"generated_text"
key has is a list of 3-sized lists. The 3rd element is just the original answer and the 2nd is the one you should be looking for. The second is the one generated by the unlearned model. I have not rungrad_ascent
but rangrad_diff
and got the below result.["Question: What is the full name of the geology author born in Karachi, Pakistan on 06/30/1975?\n", "Answer and and and and and and and and and and and and and and and and and and the and and and and and and and and and the and and and and and and the and and and and and the and and and and and and the and and and and the and and and and the and and and and the and and and the and and and the and and and the and and and the and and and the and and and the and and and and the and and and the and and and the and and and and the and and and the and and and the and and and the and and and the and and and the and and and the and and and the and and and the and and the and and the and and and the and and the and and the and and the and and the and and the and and the and and the and and the and", "The author's name is Hina Ameen." ],
Did you get the above results on llama2? I am also looking at the 2nd element of that list and seems to get the same as the ground truth. Can you share your eval stat with me if possible?
I'm working with Phi-1.5. I'm not sure what you mean by eval stat. One disclaimer, though, my results are from before the recent refactor of eval code. That refactor introduced a bunch of issues so I reverted them. It is possible that could have made a difference.
https://github.com/locuslab/tofu/blob/main/data/ft_epoch5_lr1e-05_llama2-7b_full_wd0.01/eval_results/ds_size300/eval_log_forget.json looks like the latest eval_log_forget
files look very diferrent
I'm working with Phi-1.5. I'm not sure what you mean by eval stat. One disclaimer, though, my results are from before the recent refactor of eval code. That refactor introduced a bunch of issues so I reverted them. It is possible that could have made a difference.
My results from before and after refactoring look similar corresponding to the generated answer though the aggregate_stat is slightly different! By eval_stat I mean the aggregate_stat where you get to see the scores like forget quality and model utility.
I see. It would be good to understand what exactly the refactor changed. @zhilif?
@shaswati1 can you run grad diff and check the generations? That will tell us if the issue is with the method or something else you are doing. Since grad diff definitely works for me (with Phi). Maybe you can try Phi instead of Llama as well.
One thing we noticed is that llama2 results are not exactly reproducible when flash_attention is enabled.
I finetuned llama2 on the full dataset, ran gradient ascent on forget05, and then evaluated the unlearned model on forget05. Surprisingly, when I looked at the eval_log_forget.json file all I could see was that it generates the responses as it is in the dataset. For example, Question: What is the full name of the geology author born in Karachi, Pakistan on 06/30/1975? Answer: The author's name is Hina Ameen.
Generated Answer: The author's name is Hina Ameen.
Also, the p-value is substantially low (7.82e-19) Am I interpreting the evaluated results correctly?
How many steps have you trained? Also is the p-value tested against the retain model? A small p-value means this model is very different from the retain model, which should be the case in your scenario?
I finetuned llama2 on the full dataset, ran gradient ascent on forget05, and then evaluated the unlearned model on forget05. Surprisingly, when I looked at the eval_log_forget.json file all I could see was that it generates the responses as it is in the dataset. For example, Question: What is the full name of the geology author born in Karachi, Pakistan on 06/30/1975? Answer: The author's name is Hina Ameen.
Generated Answer: The author's name is Hina Ameen.
Also, the p-value is substantially low (7.82e-19) Am I interpreting the evaluated results correctly?