richardpaulhudson / coreferee

Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further languages
MIT License
102 stars 16 forks source link

Guidelines for annotatiing own dataset to finetune coreferee pretrained model #16

Closed Tanmay98 closed 1 year ago

Tanmay98 commented 1 year ago

Hi, I am interested in annotating my own custom dataset for finetuning existing pretrained model. I have tried reviewing some of the public datasets available like

I am little confused as all are not similar to each other. Can you suggest me some basic guidelines for annotation. It would be great help. Thanks in advance!

Tanmay98 commented 1 year ago

So I tried to figure out the annotation done in LitBank dataset, I am able to produce the brat.ann and brat.txt files for NER but how do you combine them using cl-coref-annotator to produce the .tsv files.

Can you help or give some suggestions on how to proceed?

Tanmay98 commented 1 year ago

I have annotated mu own custom dataset for coreferee, but Its a small dataset for finetuning. The code provided helps me to finetune over the current public coreferee model right? Thanks in advance!

richardpaulhudson commented 1 year ago

Unfortunately Coreferee does not support finetuning an existing model with new training data, although you can certainly train a new model with the data that was used to train an existing model plus your new traning data, which should have the same effect.

If your main aim is to get Coreferee working with a custom spaCy model, please follow the instructions at https://github.com/explosion/coreferee/issues/13.

If, on the other hand, you wish to add training data to improve the accuracy of the coreference resolution itself, the annotated data is loaded in using the loader classes in https://github.com/explosion/coreferee/blob/master/coreferee/training/loaders.py. Looking at these loader classes, you will notice that they do not require all features of each of the supported annotation formats: if you are annotating data specifically for Coreferee, you only need to provide those features that the loader class in question examines. You may also find it quicker to produce training data in some other format or in a custom format and to write your own loader class to load it in. As you can see in the command under point 9 of https://github.com/explosion/coreferee/#adding-support-for-a-new-language, it is possible to combine training data in different formats, so whatever format you use you will have no problems combining it with the existing (standard) training data.

Tanmay98 commented 1 year ago

@richardpaulhudson Thanks for your response. I see, unfortunately the custom NER spacy model I trained was on small dataset as I finetuned it over the existing "en_wb_sm" model so basically I dont have any existing big dataset.

So let's suppose I want to train the coreferee model for my custom entities, how much training data would be enough for getting some decent baseline custom coreferee model? or is not feasible to re train a new coreferee model for new NER entities?

Thanks in advance!

richardpaulhudson commented 1 year ago

Please read my answer to https://github.com/explosion/coreferee/issues/13 and get back to me if there are still open questions regarding this issue.