Closed tommypolikj closed 1 week ago
Hi @tommypolikj !
The WikiEvents dataset is evaluated with 2 different strategies, standard event argument extraction and "informative" event argument extraction. In the case of the first, most (around 99.5%) of the arguments are in the same sentence as the trigger; for the second, informative arguments are defined as "We define name mentions to be more informative than nominal mentions, and pronouns to be the least informative", and they usually are outside the range of the sentence.
We focus on the first scenario and split the dataset into sentences. The 573 examples refers to the amount of sentences in the test set. You can obtain this sentence level wikievents dataset by running our pre-processing script.
Hi, I downloaded WikiEvents dataset from here:
https://github.com/raspberryice/gen-arg
and processed it using your data generation code and config. Then, I ran your evaluation pipeline with the config below (basically the same as original GoLLIE-7B eval config reduced to only one task + changes in max_seq_length and batch size):I ran it on the WikiEvents EE dev set (20 examples) but get a poor result:
I also noticed in your paper's appendix that you used a total of 573 examples when evaluating on WikiEvents EE. Could you please confirm which portion of the data you are using? And why is the performance here on the dev set low? Thanks!