hitz-zentroa / GoLLIE

Guideline following Large Language Model for Information Extraction
https://hitz-zentroa.github.io/GoLLIE/
Apache License 2.0
288 stars 19 forks source link

CoNLL F1 Evaluation #15

Closed edchengg closed 6 months ago

edchengg commented 6 months ago

Hi, does the Span F1 score in your evaluation script consider the span index similar to https://github.com/chakki-works/seqeval? 'spaceX' vs ('spaceX', 0, 1)

If not, how should I compare the CoNLL F1 score with the literature? Thanks!

https://github.com/hitz-zentroa/GoLLIE/blob/main/src/tasks/utils_scorer.py#L44

ikergarcia1996 commented 6 months ago

Hi @edchengg!

GoLLIE only outputs labeled spans. We remove every labeled span with a category that is not defined in the prompt and any labeled span that is not part of the input sentence. However, we do not consider the labeled span index in the evaluation. We compute the F1 score based solely on the gold and predicted spans and categories. Although we use a strict match approach, this means, in your example, if GoLLIE predicts Space and the gold is SpaceX, we would consider the prediction wrong . Similarly, if GoLLIE predicts Location("York") and the gold is Location("New York") the prediction would be considered wrong. In this case Location("York") will be computed as a False Positive and Location("New York") as a False Negative. This approach aligns with how seqeval computes the F1 score (span level F1 score), making our results comparable with previous CoNLL F1 scores.

The only issue is that if the same exact span of text appears twice in the sentence and is labeled with different labels each time, we don't consider the index of each prediction. For example, given the sentence "Paris (Hiton) went to Paris. Both Person(Paris), Location(Paris) and Location(Paris), Person(Paris), will be considered as correct and the F1 score will be 1.0. However, this is an extremely rare occurrence that probably doesn't happen anytime in CoNLL.

edchengg commented 6 months ago

Thanks for clarifying! I guess in many IE tasks the formal eval script consider the span index so I was a little bit confused when I see the eval code and not sure how the GoLLIE handles that by just generating a word as the output.

Right, its not common. I just want to make sure my numbers are comparable/aligned to other experiments that I ran which used the formal seqeval metric.

BTW, if I want to locate the entity/triggers in the sentence, do you have any suggestions on how to do it? I'm thinking applying a simple heuristic to do string matching with orders. For example,

"In New York City ..... New York Department of Justice..."

Assume the model extract: {New York = Location New York = Organization} how do I know which one is corresponding to which one

osainz59 commented 6 months ago

Typically with a simple string matching should be enough. Take into account that GoLLIE was trained to predict the entities in the same order as they appear in the sentence, therefore, you should map them considering the order on which they were predicted.

edchengg commented 6 months ago

Nice! Thanks :D we will follow the eval script in GoLLIE