Finetuning Implementation for LinkAppend

shaked571 commented 2 months ago

Hey,

I just came across this repo while trying to search some kind of implementation of the LinkAppend model.

I am planning to publish a paper for new coreference dataset for a new languge and tried all the models in your repo (after some work upgarding the source code, etc.)

In the paper they mentiond they finetuned the model, but I didnt see an implementation of this in the repo. Is there one I might of missed?

p.s I saw in your papaer you wanted to standartize the way we evaluate coreference (important work btw, as someone who suffer from this just now) so thanks again for youyr conrtibution.

ianporada commented 2 months ago

Essentially if you set --model_path to 'oracle' when running the LinkAppend/main.py script, the output will include a list of input/output pairs which you can then use to finetune any seq2seq model. In our case, we used the HuggingFace library to to finetune mt5 on these seq2seq input/output pairs. I'll add our finetuning code for that to this repo and put an update here once it's added (we used the same hyperparameters as the original LinkAppend paper except for model size).

What we found is that LinkAppend does not work very well for smaller mT5 sizes which makes sense because errors propagate e.g. it is very common for mT5-large to generate outputs that aren't an exact mention string which almost never happens with mT5-xxl. If you wanted to finetune a model for practical purposes using the LinkAppend setup, I would suggest finetuning the public mT5-xxl LinkAppend weights on a few examples (as in the few-shot examples in the original LinkAppend paper). If you don't have GPU memory to accommodate that you might be able to use PEFT methods such as LoRA.

ianporada commented 2 months ago

I've added an example script for finetuning mt5-large here: https://github.com/ianporada/coref-reeval/blob/main/models/LinkAppend/finetuning/main_large.py

Let me know if you have any other questions!

shaked571 commented 2 months ago

I tried to run it but I get a formatting error when I run with

python main.py --model_path oracle  \
               --max_input_size 3000   \
               --output_dir output_oracle   \
               --split test  \ 
               --batch_size 4 \
               --dataset_name preco  \
               --no_pound_symbol  \
               --subset 10   
               --subset_start 0

seem like preco daraset doesnt fit the scheme and it is missing the field 'coref_spans', unfortunately I dont have access to the default dataset (raised a request via hugginface) so when I execute load_dataset('coref-data/conll2012_indiscrim', 'english_v4') it fails and I get an axception as well.

How exactly I should run main.py?

p.s sorry for the delayed answer needed to setup an a100 machine to get things running

ianporada / coref-reeval

Finetuning Implementation for LinkAppend #1