ianporada / coref-reeval

A controlled reevaluation of coreference resolution models
5 stars 1 forks source link

MT5 fine-tuning #5

Open Gypsophila1006 opened 2 months ago

Gypsophila1006 commented 2 months ago

When I try to fine-tune the mt5 model, I cannot obtain the 'llm-coref/mt5-coref-ontonotes' dataset. Is this a private dataset? How can I obtain it? Thank you!

ianporada commented 2 months ago

Hi, yes this is a private dataset as we cannot distribute the ontonotes data. You can recreate the dataset by setting --model_path as oracle in which case the inference code will generate the training data.

For convenience I've made it public gated so that you can request access if you already have access to the ontonotes data: https://huggingface.co/datasets/llm-coref/mt5-coref-ontonotes

ianporada commented 2 months ago

That being said there is not great performance for finetuning MT5 at the base and large sizes (as shown in the paper). At these sizes the model often "hallucinates" mentions which has a compounding negative effect with each sentence.

Gypsophila1006 commented 2 months ago

I want to try to do Chinese coreference resolution. Did you only use the English data in the ontonotes dataset when fine-tuning? Could you please share your method of processing the ontonotes dataset? I would like to refer to it and implement Chinese.

ianporada commented 2 months ago

I only used English data. The process for generating the data is to run the command in the readme with model set to oracle e.g.:

python main.py \
    --model_path oracle \
    --max_input_size 3000 \
    --output_dir $OUTPUT \
    --split test \
    --batch_size 4 \
    --dataset_name preco \
    --no_pound_symbol \
    --subset 10 \
    --subset_start 0

which will generate a final of input/output pairs.

Let me know if that works for you. I can try to generate Chinese training data when I get the chance. The Chinese OntoNotes data uniquely has "zero anaphora" annotations, but I believe the training data generation process should still work as above.

ianporada commented 2 months ago

I've generated training data for chinese ontonotes by running the following:

cd models/decoder_based/LinkAppend 

MODEL_CHECKPOINT=oracle
OUTPUT=~/linkappend_output
mkdir $OUTPUT

python main.py \
    --model_path $MODEL_CHECKPOINT \
    --max_input_size 3000 \
    --output_dir $OUTPUT \
    --split train \
    --batch_size 1 \
    --dataset_name ontonotes_chinese \
    --no_pound_symbol

The data is available here: https://huggingface.co/datasets/llm-coref/mt5-coref-ontonotes-chinese

Please let me know if it looks correct to you