princeton-nlp / MQuAKE

[EMNLP 2023] MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions
https://arxiv.org/abs/2305.14795
MIT License
99 stars 7 forks source link

Can you release the codes for evaluation and training hyperparameters? #6

Open ShuoZhangXJTU opened 1 year ago

ShuoZhangXJTU commented 1 year ago

I am trying to reproduce the results on MQUAKE-T and found the multihop results for "Base" are way less (16.22/22.59 for multihop and cot) than reported in Table 4. And I can not reproduce the FT results either.

Can you release your codes on evaluation and your own training hyperparameters for reproduction?

a3616001 commented 11 months ago

Hi @ShuoZhangXJTU !

Sorry for the evaluation issue! The bug is that the MQuAKE-T dataset we released before didn't contain the extended pre-edit gold answers (Appendix E in our updated version). This will cause a much lower performance of the base model due to the time mismatch of the training corpus and our Wikidata dump.

I have updated the dataset and MQuAKE-T includes a new field answer_extended which we used in our experiments. You should also use this filed for evaluating the base model before editing.

For FT results: we use the same hyperparameters as MEMIT did.

ShuoZhangXJTU commented 10 months ago

Hi Zexuan,

Thank you for your update! I will use that latest version then.

Best regards,

Shuo

-----原始郵件----- 發件人:"Zexuan Zhong" @.> 發送時間:2023-11-27 05:53:06 (星期一) 收件人: princeton-nlp/MQuAKE @.> 抄送: "Shuo Zhang" @.>, Mention @.> 主題: Re: [princeton-nlp/MQuAKE] Can you release the codes for evaluation and training hyperparameters? (Issue #6)

Hi @ShuoZhangXJTU !

Sorry for the evaluation issue! The bug is that the MQuAKE-T dataset we released before didn't contain the extended pre-edit gold answers (Appendix E in our updated version). This will cause a much lower performance of the base model due to the time mismatch of the training corpus and our Wikidata dump.

I have updated the dataset and MQuAKE-T includes a new field answer_extended which we used in our experiments. You should also use this filed for evaluating the base model before editing.

For FT results: we use the same hyperparameters as MEMIT did.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>