Closed gxxxzhang closed 9 months ago
Hi, thank you so much for your interest in our work!
1) Your assumption is correct for the first part of the results. As (at the time of our writing) it was the first time leveraging in-context few-shot learning for document-level
relation extraction, we had this controlled experiment to see whether it is effective enough for a further exploration. (Of note, we evaluated REBEL under the same assumption for this part). Please keep in mind that for the second part of the results, we evaluated the methods for all the dev docs and for all relation types, by also leveraging an external knowledge base for checking the correctness. Btw, we later found that some numbers were incorrect for the second part (due to a bug in evaluation code), which will be fixed in the next version of our preprint.
2) I am not sure if I completely followed this question, because maybe it might be more relevant for BERT-like models. Yet, let me try to elaborate on this, hoping that it will answer your question. In DocRED, the relations are annotated with their subject and object entities, where each entity is also annotated with one/many mention(s). So, in our evaluation, we counted the extracted relation as correct if both generated subject and object "completely" match any of the mentions of the corresponding entity.
3) This confusion is caused by a different wording of actually the same metric, which we will fix in the next iteration again. It should be called Micro-F1, i.e. aggregating all the TPs, FPs, FNs etc. for the final calculation. Of note, we are not counting no-relation in this computation which could be the cause of mismatch. We delibaretly didn't use no-relation because it requires enumerating all the entities and their paired combinations for the evaluation, whereas, in our proposal we aim to eliminate such need.
Again, thank you so much for reaching out to us about your questions!
Thank you very much for your answer. I am still very confused about the last two sentences in paragraph 3. Could you please explain them in detail again? Thanks!
Hi, thanks for the follow-up! I will try to briefly explain here, by noting that it could be useful to check our Problem Description section in the paper.
Let's assume that there are $N$ entities in a document, and there are $R$ relation types in total.
The usual setup classifies the relation between every pair of entities, i.e. $O(N^2)$ classification (multi-class is possible, but let's ignore for simplicity). These works usually introduce an artificial $R+1$ th relation for classification, which stands for no-relation. Therefore, if there are only $M$ actual knowledge triplets in a document, where $M << O(N^2)$, the rest $O(N^2) - M$ entity pairs should be classified as $R+1$ th relation (i.e. no relation). As this $R+1$ th relation becomes the majority class (due to the natural sparsity of other relation types in the documents), it inflates the micro-F1 score to a higher level.
When it comes to our work, the main point is to eliminate the need for the enumeration of named-entities in advance. As a result, we evaluate only based on these $M$ knowledge triplets, i.e., there is no $R+1$ th relation. If our framework generates the (real, sub, obj) triplets that do not exist, then they simply counted as false positives. Analogously, one can treat non-generated entity pairs that also do not share any relation as true negatives -> This is different than the others that treat them as true positives of $R+1$th relation, which inflates the score of micro-F1 as described above.
I hope my clarifications would be helpful.
Edit: Actually the number of entity pairs do not require an approximation, which is directly $N \cdot (N-1) / 2$. Here I wrote as $O(N^2)$ classifications by accounting the multi-class scenario, but I hope the overall message is clear.
Thanks for your quick and detailed reply!
Your work is enlightening and innovative. However, I still have queries about the experimental metrics of the paper, especially the REBEL as the baseline model. To some extent, I do not agree with this statement: This is different than the others that treat them as true positives of $R+1$th relation, which inflates the score of micro-F1 as described above. Because, I think existing works, including REBEL, exclude the 'no_relation' in the micro f1 when evaluating their model. May I ask if you can provide some specific evidence to supplement.
In addition, as you mentioned, the number of negative instances is much greater than the number of positive instances (97.1% relation triplets express 'no-relation' in DocRED). So, I think, if there is 'no_relation' in evaluation, the micro F1 will increase sharply.
Looking forward to your reply again.
Firstly, according to the file relation_docs_dev in the provided code, can I assume that in your method, the distribution of relation for each test document is known?
Secondly, in your experiment, the evaluation setting is mention-level or entity-level? The entity-level evaluation is that a mention is counted as correct if its span matches a ground truth mention span. Finally, since I did not conduct experiments on the method of REBEL, I am not sure if it is due to the significant difference in F1 scores caused by using Micro-F1 in REBEL and weighted-averaged F1 in your method.