princeton-nlp / MQuAKE

[EMNLP 2023] MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions
https://arxiv.org/abs/2305.14795
MIT License
99 stars 7 forks source link

How to use Mello for batch? #4

Closed Eureka-Maggie closed 11 months ago

Eureka-Maggie commented 1 year ago

I don't see in mello.ipynb how I should use batch to test the results and I would like to know how to 'consider a batch of k instances as once where k ∈{1, 100, 1000, 3000}' in table 5. Thanks a lot. Looking forward for your reply!

chuckhope commented 1 year ago

I don't quite understand how to apply MELLO to a batch of 'k' instances; intuitively, I think it might extend over the length of the input window. On the other hand, if you tackle each question separately, why does the score drop?

a3616001 commented 1 year ago

Hey @Eureka-Maggie @chuckhope ! Sorry for the late response. The MeLLo code we released (run_mello.ipynb) is actually for k=3000, where we build a edited fact memory (a retrieval index) for all the edits from k=3000 instances.

The case with a smaller k basically means that we consider a subset of instances at the same time. In that case, the retrieval index size will be smaller and retrieval will be more accurate. This should be very straightforward to test based on our code with k=3000 (you can just sample a subset of 100 instances for the k=100 case).

Note that a batch of k instances does not mean that we will retrieval multiple edits from the index -- we only consider the top-1 edit no matter the memory size.

AOZMH commented 8 months ago

Hey @Eureka-Maggie @chuckhope ! Sorry for the late response. The MeLLo code we released (run_mello.ipynb) is actually for k=3000, where we build a edited fact memory (a retrieval index) for all the edits from k=3000 instances.

The case with a smaller k basically means that we consider a subset of instances at the same time. In that case, the retrieval index size will be smaller and retrieval will be more accurate. This should be very straightforward to test based on our code with k=3000 (you can just sample a subset of 100 instances for the k=100 case).

Note that a batch of k instances does not mean that we will retrieval multiple edits from the index -- we only consider the top-1 edit no matter the memory size.

Thanks for the explanation! What still confuses me is, in the case where k<3000, how did you sample top-k editing-facts from all the 3000 facts? Is there any similarity- or semantic-based logic adopted? Would the correct edit of each question be guaranteed to exist in the k selected edits? I assume the latter answer is yes since in Table-5 the scores of mello when K=1 is the highest, but I'm not sure how this guarantee is done technically.

Hope to get your reply, thanks!

a3616001 commented 8 months ago

Hi @AOZMH I may not fully understand your question. To be concrete, for k<3000 cases, say k = 100, we randomly split 3000 instances into 30 groups, each of which contains 100 instances. That means, we have 30 disjoint evaluation groups. We build the retrieval memory using all edits associated with the k instances in a group.

Hope it helps - but let me know if I misunderstand your question!

AOZMH commented 8 months ago

Hi @AOZMH I may not fully understand your question. To be concrete, for k<3000 cases, say k = 100, we randomly split 3000 instances into 30 groups, each of which contains 100 instances. That means, we have 30 disjoint evaluation groups. We build the retrieval memory using all edits associated with the k instances in a group.

Hope it helps - but let me know if I misunderstand your question!

Thanks for the reply and example! Following your example, my question should be:

Assume that the 3000 edits are named as <e1, e2, ..., e3000> and are splitted into 30 groups sequentially (i.e. group_1 = <e1, e2, ..., e30>, group_2 = <e31, e32, ..., e60>, etc.). We are now dealing with an instance in MQUAKE involving two edits e1 and e31, included in two different edit groups.

In this case, since we only choose one group of k edits as fact memory, MELLO's retriever should be unable to fully retrieve all the edits (either e1 or e31 would be excluded from the "fact memory", or maybe both would be excluded).

As stated in Section 5.2,

... in order to answer multi-hop questions correctly after editing, the retriever we use in MeLLo needs to retrieve all the associated edited facts from the memory.

I'd like to know how did you handle such cases (when the fact memory fails to contain all required edits). Did you adopt any edit-sampling scheme to guarantee that all the required facts are included in the fact memory consisting of k edits?

Thanks again for your help & wish to get your reply!

AOZMH commented 8 months ago

Hi @AOZMH I may not fully understand your question. To be concrete, for k<3000 cases, say k = 100, we randomly split 3000 instances into 30 groups, each of which contains 100 instances. That means, we have 30 disjoint evaluation groups. We build the retrieval memory using all edits associated with the k instances in a group.

Hope it helps - but let me know if I misunderstand your question!

[Update] After a further research into the paper, I finally realized that the aforementioned problem is caused by my misunderstanding of the concept "instance" you mentioned. Specifically, while I consider a "instance" as an editing-statement, it is actually an entry in the MQUAKE dataset. To this end, in the case when k<3000, all edits corresbonding to the k dataset entries (each entry contains 1 to 4 edits) are viewed as fact memory.

Sorry for the misunderstanding & thanks again for the help!