When I reproduced ToFU's performance on llama2-finetuned(origin) and llama2-chat(retain), I found that there was a significant difference in model utility between the two. However, in the leaderboard, these values are almost identical. If an unfinetuned model can also perform so well on ToFU, does this imply some degree of data leakage? I look forward to your response. Thank you for your time.
my result:llama-chat(finetune) on each forget split all gets the Model Utility of 0.2487, and retained version gets the Model Uitlity of 0.624.
When I reproduced ToFU's performance on llama2-finetuned(origin) and llama2-chat(retain), I found that there was a significant difference in model utility between the two. However, in the leaderboard, these values are almost identical. If an unfinetuned model can also perform so well on ToFU, does this imply some degree of data leakage? I look forward to your response. Thank you for your time.