kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.54k stars 452 forks source link

How to add new tags? #1087

Open dlculver opened 8 months ago

dlculver commented 8 months ago

Hello,

I am interested in training my own Grobid to work on documents in a different domain from scientific papers. At the moment, I want to train a header model to identify particular parties in my documents. I am a bit confused as to what this process is. As I understand it, I am supposed to take some pdfs, I use Grobid's batch mode to generate training and evaluating data, I then annotate this manually, and then train the model. However, I am very confused about how to add new tags to TEI schemas. Where, in particular, do I need to add new tags in order to train a header model.

Thanks!

lfoppiano commented 8 months ago

Dear @dlculver, thanks for your interest in Grobid. Modifying the training data is a complex process at first.

Could you please explain a bit more in detail what you want to do? With "add new tags" do you mean to extend the existing tagset? or to just use the existing tags for additional objects in the TEI?