Not Able to Reproduce WikiEvents EE Performance

Hi, I downloaded WikiEvents dataset from here: https://github.com/raspberryice/gen-arg and processed it using your data generation code and config. Then, I ran your evaluation pipeline with the config below (basically the same as original GoLLIE-7B eval config reduced to only one task + changes in max_seq_length and batch size):

#Training args
model_name_or_path: HiTZ/GoLLIE-7B
torch_dtype: bfloat16
use_lora: false
quantization: 4
quantization_inference: null
gradient_checkpointing: true
force_auto_device_map: false
use_flash_attention: true

# dataset arguments
dataset_dir:
  /iliad/u/tommy01/multi-news/GoLLIE/data/processed_w_examples
train_tasks:
  - wikievents.eae
  - wikievents.ee
validation_tasks:
  # - wikievents.eae
  - wikievents.ee
test_tasks:
  # - wikievents.eae
  - wikievents.ee
max_examples_per_task_train: 30000
max_examples_per_task_val: 5000
max_examples_per_task_test: null
max_seq_length: 8192
generation_max_length: 8192
ignore_pad_token_for_loss: true
prompt_loss_weight: 0.0

# checkpoint settings
output_dir: /iliad/u/tommy01/multi-news/GoLLIE/GoLLIE+-7b_CodeLLaMA
overwrite_output_dir: true
load_best_model_at_end: false
save_strategy: "epoch"
save_steps: 1000
save_total_limit: 999

# evaluation
do_train: false
do_eval: false
do_predict: true
evaluation_strategy: "steps"
eval_steps: 500
eval_delay: 0
predict_with_generate: true
evaluate_all_checkpoints: false

# batch size
per_device_train_batch_size: 2
per_device_eval_batch_size: 1
gradient_accumulation_steps: 1
generation_num_beams: 1

# optimizer settings
optim: adamw_torch_fused
learning_rate: 0.0003
weight_decay: 0.0
num_train_epochs: 3
lr_scheduler_type: cosine
warmup_ratio: 0.03
adam_epsilon: 1e-7

# lora settings
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
  - all

# reporting
logging_strategy: steps
logging_first_step: true
logging_steps: 25
report_to: wandb
run_name: "GoLLIE+-7b_CodeLLaMA"
disable_tqdm: false

# hub settings
push_to_hub: false
resume_from_checkpoint: false

# performance
bf16: true
fp16: false
torch_compile: false
ddp_find_unused_parameters: false

I ran it on the WikiEvents EE dev set (20 examples) but get a poor result:

"events": {
            "precision": 0.4,
            "recall": 0.11014492753623188,
            "f1-score": 0.17272727272727273
        },
        "arguments": {
            "precision": 0.0,
            "recall": 0.0,
            "f1-score": 0.0
        },

I also noticed in your paper's appendix that you used a total of 573 examples when evaluating on WikiEvents EE. Could you please confirm which portion of the data you are using? And why is the performance here on the dev set low? Thanks!

Hi @tommypolikj !

The WikiEvents dataset is evaluated with 2 different strategies, standard event argument extraction and "informative" event argument extraction. In the case of the first, most (around 99.5%) of the arguments are in the same sentence as the trigger; for the second, informative arguments are defined as "We define name mentions to be more informative than nominal mentions, and pronouns to be the least informative", and they usually are outside the range of the sentence.

We focus on the first scenario and split the dataset into sentences. The 573 examples refers to the amount of sentences in the test set. You can obtain this sentence level wikievents dataset by running our pre-processing script.

hitz-zentroa / GoLLIE

Not Able to Reproduce WikiEvents EE Performance #21