kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.61k stars 461 forks source link

Re-Training the header model #198

Open cwenge opened 7 years ago

cwenge commented 7 years ago

Hello guys,

I am a student and kind of new in using AI-techniques in practice - especially using Grobid and I would like to ask some basic questions. This is my situation: I want to extract informations (title, authors, abstract paragraph and the publishing magazine) from pdf files. Grobid already works quite well with your pre-trained model, especially for titles, authors and the abstract paragraphs. The image below is a example of a paper I want to extract. The magazine is marked yellow and I would like to extract this information as well. I have quite a lot of papers from Elsevier always in the same structure.

I want to re-train the header model with these files. So I created the training files using the Grobid batch mode and was about to correct these files, but the yellow marked text isn't present in the .header.tei.xml file. So here are my questions:

1) Your guideline tells 'to keep the stream of text untouched' and the 'text itself shall not be modified or corrected'. So I can't just add the text with some kind of tag in order to train Grobid taking this information as well? (Probably because the .header.tei.xml and .header file won't fit together then?)

2) Should I rather use the fulltext model?

3) To speed up the process of correcting data, I was thinking about just deleting those parts of tags and texts in .header.tei.xml files I am not even interested in. However, the problem here is that I would modify the stream of text once again? Furthermore, the learning #process is probably better when the whole text is tagged and in some kind of relation to the other parts?

4) Basically I am not even quite sure what to correct then in the .header.tei.xml files. I only can modify tags and have to check whether i.e. the abstract paragraph is not tagged as introduction?

5) Is there any (more ore less easy) way to make Grobid learn extracting (new types of) data, it was not used to extract?

I hope I could describe my problem clearly. Thank you in advance.

image

kermitt2 commented 7 years ago

Hello!

Normally this yellow piece of text should be present in the .header.tei.xml and .header.tei.xml files as note - ok not good but at least present:

    <note type="other">B ENVlRONMENTAL<lb/> ELSEVIER<lb/> Applied Catalysis B: Environmental 15 (1998) 5-19<lb/></note>
  1. So relatively to your question 1., indeed you can't add or remove text - you can only move tags in order to ensure the alignment with the .header file will work correctly.

So you would need to change the excerpt above as:

    <note type="other">B ENVlRONMENTAL<lb/> ELSEVIER<lb/></note> 
    <reference>Applied Catalysis B: Environmental 15 (1998) 5-19<lb/></reference>

(modifying the end-of-line is OK, because the explicit line break markers <lb/> are used)

  1. Full text mode is not related at all to the header model.

  2. Yes don't do that :)

  3. The missing piece is the segmentation model. Before the header is processed, it is identified as general area by the segmentation model. So if you observe a piece of header missing or too much text in the header files, you have first to add training data in the segmentation model to fix it. So run createTrainingSegmentation, correct the training data for the segmentation model (by moving the header tags accordingly) and retrain the segmentation model.

  4. Yes, you can define new labels for the model you want to extend (under org/grobid/core/engines/label/). Then map these labels to XML tags (by modifying the SAX parsers). Then add these tags in the XML mark-up of the training file (the heaviest time-consuming work). Finally you have to exploit these new labels to create data structures when reading the CRF results (this is usually in a method called resultExtraction in each parser).

The best is to have a look at the different GROBID modules that extract different types of structures than those of the main grobid: grobid-quantities (extracting physical measures), grobid-ner (extractng named entities), grobid-dictionaries (extracting lexical entries in dictionaries), grobid-astro (extracting astronomical entities), ...

naimavahab commented 6 years ago

Hi, 2 more questions related to this problem. 1) is it like we can move tags or edit tags.. For eg. <forename>Ji-Eun</forename> can be edited as <surname>Cuong</surname> 2) can we move tags inside to other tags .. For eg.
`

Ji-Eun
                             <surname>Cuong</surname>
                        </persName>
                        <persName>
                             <forename>Mai</forename>
                             <surname>Nguyen2&apos;3&apos;&apos;&apos;                                             </surname>,
                        </persName>`

This one i want to give Cuong in the surname as forname for the second author.. like

` Ji-Eun

Cuong Mai Nguyen2'3''', `
YangaPri commented 6 years ago

Hi,

I am having one more question on bibliography reference incremental training in GROBID.

Is it is possible to add extra newly named tags inside the reference beyond the already existing tag names.

Eg.

DOI: <idno type="doi">10.1016/j.artmed.2010.12.004</idno>, PMID: <idno type="pmid">21232927</idno>

Hereby i added 'PMID' number inside reference as like of 'DOI' number.

  1. Let me know whether we can add new tagging of elements in reference on the GROBID training data set?
  2. Also after training those data set it will identify correctly those elements with my newly introduced tag name?
  3. It will give the tie.xml with the same tag name as i trained the model?

Thank you in advance.