Open Gypsophila1006 opened 2 months ago
Hi, yes this is a private dataset as we cannot distribute the ontonotes data. You can recreate the dataset by setting --model_path
as oracle
in which case the inference code will generate the training data.
For convenience I've made it public gated so that you can request access if you already have access to the ontonotes data: https://huggingface.co/datasets/llm-coref/mt5-coref-ontonotes
That being said there is not great performance for finetuning MT5 at the base and large sizes (as shown in the paper). At these sizes the model often "hallucinates" mentions which has a compounding negative effect with each sentence.
I want to try to do Chinese coreference resolution. Did you only use the English data in the ontonotes dataset when fine-tuning? Could you please share your method of processing the ontonotes dataset? I would like to refer to it and implement Chinese.
I only used English data. The process for generating the data is to run the command in the readme with model set to oracle e.g.:
python main.py \
--model_path oracle \
--max_input_size 3000 \
--output_dir $OUTPUT \
--split test \
--batch_size 4 \
--dataset_name preco \
--no_pound_symbol \
--subset 10 \
--subset_start 0
which will generate a final of input/output pairs.
Let me know if that works for you. I can try to generate Chinese training data when I get the chance. The Chinese OntoNotes data uniquely has "zero anaphora" annotations, but I believe the training data generation process should still work as above.
I've generated training data for chinese ontonotes by running the following:
cd models/decoder_based/LinkAppend
MODEL_CHECKPOINT=oracle
OUTPUT=~/linkappend_output
mkdir $OUTPUT
python main.py \
--model_path $MODEL_CHECKPOINT \
--max_input_size 3000 \
--output_dir $OUTPUT \
--split train \
--batch_size 1 \
--dataset_name ontonotes_chinese \
--no_pound_symbol
The data is available here: https://huggingface.co/datasets/llm-coref/mt5-coref-ontonotes-chinese
Please let me know if it looks correct to you
When I try to fine-tune the mt5 model, I cannot obtain the 'llm-coref/mt5-coref-ontonotes' dataset. Is this a private dataset? How can I obtain it? Thank you!