eval generates answer same as dataset - Githubissues

locuslab / tofu

Landing Page for TOFU

MIT License

71 stars 15 forks source link

eval generates answer same as dataset #16

Open shaswati1 opened 4 months ago

shaswati1 commented 4 months ago

I finetuned llama2 on the full dataset, ran gradient ascent on forget05, and then evaluated the unlearned model on forget05. Surprisingly, when I looked at the eval_log_forget.json file all I could see was that it generates the responses as it is in the dataset. For example, Question: What is the full name of the geology author born in Karachi, Pakistan on 06/30/1975? Answer: The author's name is Hina Ameen.

Generated Answer: The author's name is Hina Ameen.

Also, the p-value is substantially low (7.82e-19) Am I interpreting the evaluated results correctly?

molereddy commented 4 months ago

In eval_log_forget.json, "generated_text" key has is a list of 3-sized lists. The 3rd element is just the original answer and the 2nd is the one you should be looking for. The second is the one generated by the unlearned model. I have not run grad_ascent but ran grad_diff and got the below result. ["Question: What is the full name of the geology author born in Karachi, Pakistan on 06/30/1975?\n", "Answer and and and and and and and and and and and and and and and and and and the and and and and and and and and and the and and and and and and the and and and and and the and and and and and and the and and and and the and and and and the and and and and the and and and the and and and the and and and the and and and the and and and the and and and the and and and and the and and and the and and and the and and and and the and and and the and and and the and and and the and and and the and and and the and and and the and and and the and and and the and and the and and the and and and the and and the and and the and and the and and the and and the and and the and and the and and the and", "The author's name is Hina Ameen." ],

shaswati1 commented 4 months ago

In eval_log_forget.json, "generated_text" key has is a list of 3-sized lists. The 3rd element is just the original answer and the 2nd is the one you should be looking for. The second is the one generated by the unlearned model. I have not run grad_ascent but ran grad_diff and got the below result. ["Question: What is the full name of the geology author born in Karachi, Pakistan on 06/30/1975?\n", "Answer and and and and and and and and and and and and and and and and and and the and and and and and and and and and the and and and and and and the and and and and and the and and and and and and the and and and and the and and and and the and and and and the and and and the and and and the and and and the and and and the and and and the and and and the and and and and the and and and the and and and the and and and and the and and and the and and and the and and and the and and and the and and and the and and and the and and and the and and and the and and the and and the and and and the and and the and and the and and the and and the and and the and and the and and the and and the and", "The author's name is Hina Ameen." ],

Did you get the above results on llama2? I am also looking at the 2nd element of that list and seems to get the same as the ground truth. Can you share your eval stat with me if possible?

molereddy commented 4 months ago

I'm working with Phi-1.5. I'm not sure what you mean by eval stat. One disclaimer, though, my results are from before the recent refactor of eval code. That refactor introduced a bunch of issues so I reverted them. It is possible that could have made a difference.

molereddy commented 4 months ago

https://github.com/locuslab/tofu/blob/main/data/ft_epoch5_lr1e-05_llama2-7b_full_wd0.01/eval_results/ds_size300/eval_log_forget.json looks like the latest eval_log_forget files look very diferrent

shaswati1 commented 4 months ago

I'm working with Phi-1.5. I'm not sure what you mean by eval stat. One disclaimer, though, my results are from before the recent refactor of eval code. That refactor introduced a bunch of issues so I reverted them. It is possible that could have made a difference.

My results from before and after refactoring look similar corresponding to the generated answer though the aggregate_stat is slightly different! By eval_stat I mean the aggregate_stat where you get to see the scores like forget quality and model utility.

molereddy commented 4 months ago

I see. It would be good to understand what exactly the refactor changed. @zhilif?

molereddy commented 4 months ago

@shaswati1 can you run grad diff and check the generations? That will tell us if the issue is with the method or something else you are doing. Since grad diff definitely works for me (with Phi). Maybe you can try Phi instead of Llama as well.

zhilif commented 3 months ago

One thing we noticed is that llama2 results are not exactly reproducible when flash_attention is enabled.

zhilif commented 3 months ago

I finetuned llama2 on the full dataset, ran gradient ascent on forget05, and then evaluated the unlearned model on forget05. Surprisingly, when I looked at the eval_log_forget.json file all I could see was that it generates the responses as it is in the dataset. For example, Question: What is the full name of the geology author born in Karachi, Pakistan on 06/30/1975? Answer: The author's name is Hina Ameen.

Generated Answer: The author's name is Hina Ameen.

Also, the p-value is substantially low (7.82e-19) Am I interpreting the evaluated results correctly?

How many steps have you trained? Also is the p-value tested against the retain model? A small p-value means this model is very different from the retain model, which should be the case in your scenario?