kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.43k stars 443 forks source link

Handling Jr., Sr. in names (Affiliation and Citation) #196

Open dominic-sps opened 7 years ago

dominic-sps commented 7 years ago

I am using latest version 0.4.2 and checked the following issues in Windows 7 as well as CentOS 7 Reference Citation Sample checked: Clow GD, McKay CP, Simmons Jr. GM, and Wharton RA, Jr. 1988. Climatological observations and predicted sublimation rates at Lake Hoare, Antarctica. Journal of Climate 1:715-728.

Issue 1. It changes the forename "GD" as "Gd"; "CP" as "Cp" etc. Issue 2. Captures Jr. as surname and tags "GM" as separate surname without a forename

<author>
    <persName>
        <forename type="first">Cp</forename>
        <surname>Mckay</surname>
    </persName>
</author>    
<author>     
    <persName>
        <forename type="first">Simmons</forename>
        <surname>Jr</surname>
    </persName>
</author>    
<author>     
    <persName>
        <surname>Gm</surname>
    </persName>
</author>
  1. How to retain the forename (initials) as it is without converting it to camel case.
  2. There are enough data sets in "name\header\corpus" with Jr (for eg.) and don't know why GROBID is not capturing it in suffix tag. This is happening in the header part as well as in the citation part.

For the suffix issue, attached a PDF from NCBI related to grobid-trainer/resources/dataset/name/header/corpus/1468-6708-3-10.authors.tei.xml 1468-6708-3-10.pdf

dominic-sps commented 7 years ago

I am trying GROBID with one of your grobid-example\src\test\resources\Wang_paperAVE2008.pdf and copied this pdf in \test\in for my following test. I changed the author name to "Rui Wang Jr".

In Windows 7, 64bit with createTrainingHeader: java -Xmx1024m -jar \grobid\grobid-core\target\grobid-core-0.4.2-SNAPSHOT.one-jar.jar -gH \grobid\grobid-home -gP \grobid\grobid-home\config\grobid.properties -dIn \test\in -dOut \test\out -exe createTrainingHeader

I am looking at the Wang_paperAVE2008.authors.tei.xml file created. Here the results more accurate and are as per my requirement. Most of the content is also present in the output.

Then I am running with processHeader: java -Xmx1024m -jar \grobid\grobid-core\target\grobid-core-0.4.2-SNAPSHOT.one-jar.jar -gH \grobid\grobid-home -gP \grobid\grobid-home\config\grobid.properties -dIn \test\in -dOut \test\out -exe processHeader

I am looking at the Wang_paperAVE2008.tei.xml file at the author area only. Here the element identification is mostly wrong. Both the commands loads same model files except the first one uses the segmentation\model.wapiti in addition.

I am looking at the content of the XML and not worried about the structure. I see a difference where createTrainingHeader works more properly. Wang_paperAVE2008.pdf

kermitt2 commented 7 years ago

Thanks @dominic-sps for reporting these issues!

For your second post, see issue #200 for explanations and how to have the same via processHeader as via createTrainingHeader.

For the first one, these are indeed two separate issues:

  1. these kind of forenames "GD" are not well post-processed, but it should be straightforward to fix that

  2. missing Jr is due to the training data, there are too few examples of Jr and Sr in the name-citation training data. We could add more :)

dominic-sps commented 7 years ago

missing Jr is due to the training data, there are too few examples of Jr and Sr in the name-citation training data. We could add more :)

I have access to major STM publishers' (SpringerNature, Elsevier, Wiley & TnF) header XML files but not the PDFs. To create the training datasets do I have to use createTrainingHeader option to process PDFs or I can write an XSLT for the available header XML files and directly generate the *.affiliation.tei.xml and *.authors.tei.xml and train my GROBID engine.

kermitt2 commented 7 years ago

In GROBID, author names in reference citations (your first post in this issue) are structured with a different model than author names in the header.

For reference citation, the model name-citation is used. You need to use createTrainingFullText on a PDF. Among the different files created you will have those corresponding to the reference citations and those corresponding to names in the reference citations - the later being the one of interest for you.

You can also simply add examples based on the string of authors in the reference string, see the examples under grobid-trainer/resources/dataset/name/citation/corpus/author.tei.xml for instance.

dominic-sps commented 7 years ago

Thank you and noted the citation related training. I would like to know more about training the header part?

I am currently trying to structure the raw manuscript (new unpublished) in Word format into usable XML format. I am automatically cleaning the Word document and converting into PDF format. Then I am using GROBID to process the PDF file. At first we are targeting only the header part and not body or references.

Now to create the training datasets to process my header part

  1. do I have to use createTrainingHeader option to process PDFs and create the required training XML files or
  2. I can write an XSLT for the available valid header XML files and directly generate the *.affiliation.tei.xml and *.authors.tei.xml and train my GROBID engine. However, I may not be able to generate the grobid-trainer/resources/dataset/header/corpus/headers/*.headerfile.

Appreciate your suggestion on the above.

kermitt2 commented 7 years ago

Normally, you have to use createTrainingHeader, so that you can obtain the *.header files necessary for training, so option 1. You can then correct the header annotations, retrain and focus on the secondary models (the models used in cascade from the result of the header model) the affiliation and header author names models/training files.

kermitt2 commented 7 years ago

I reopen because I will work on both better post-processing of initials and adding more suffix examples in the training data of authors in header.

dominic-sps commented 7 years ago

Thank you for reopening this request. Earlier I fixed it temporarily in

grobid-cpde/src/main/java/org/grobid/code/data/Person.java commented line 40 and added a line firstName = f; It worked well for my issue 1 above (the forename "GD" as "Gd"; "CP" as "Cp" etc.). However. the installation test failed stating that "expected Karl-Heinz got KARL-HEINZ". I am fine with this and I skipped the test and went ahead.

I am not sure how to tag the below names in my training corpus Clow GD, McKay CP, Simmons Jr. GM, and Wharton RA, Jr. <author><surname>Wharton</surname> <forename>RA</forename>, <suffix>Jr</suffix>.</author> or <author><surname>Wharton</surname> <forename>R</forename><middlename>A</middlename>, <suffix>Jr</suffix>.</author> Let me know and I can arrange few more training data for Jr. Sr. names

kermitt2 commented 7 years ago

Hello!

I think this is correct to have KARL-HEINZ normalized into Karl-Heinz (and the test is here to ensure that). I worked a bit more on the two issues. Your example gives now the following (since commit 677f594f59c9788f28bb516b3e79238989b17589):

<biblStruct >
    <analytic>
        <title level="a" type="main">Climatological observations and predicted sublimation rates at Lake Hoare</title>
        <author>
            <persName
                xmlns="http://www.tei-c.org/ns/1.0">
                <forename type="first">G</forename>
                <forename type="middle">D</forename>
                <surname>Clow</surname>
            </persName>
        </author>
        <author>
            <persName
                xmlns="http://www.tei-c.org/ns/1.0">
                <forename type="first">C</forename>
                <forename type="middle">P</forename>
                <surname>Mckay</surname>
            </persName>
        </author>
        <author>
            <persName
                xmlns="http://www.tei-c.org/ns/1.0">
                <forename type="first">G</forename>
                <forename type="middle">M</forename>
                <surname>Simmons</surname>
                <genName>Jr</genName>
            </persName>
        </author>
        <author>
            <persName
                xmlns="http://www.tei-c.org/ns/1.0">
                <forename type="first">R</forename>
                <forename type="middle">A</forename>
                <surname>Wharton</surname>
                <genName>Jr</genName>
            </persName>
        </author>
    </analytic>
    <monogr>
        <title level="j">Antarctica. Journal of Climate</title>
        <imprint>
            <biblScope unit="volume">1</biblScope>
            <biblScope unit="page" from="715" to="728" />
            <date type="published" when="1988" />
        </imprint>
    </monogr>
</biblStruct>

I think this corresponds to the expected result and formatting.

More training data for suffixes like Sr. Jr would be very welcome, there are almost no example right now. In the training data, I have annotated the sequence Clow GD, McKay CP, Simmons Jr. GM, and Wharton RA, Jr. as

<author><lastname>Clow</lastname> <forename>GD</forename>, <lastname>McKay</lastname> <forename>CP</forename>, <lastname>Simmons</lastname> <suffix>Jr</suffix>. <forename>GM</forename>, and <lastname>Wharton</lastname> <forename>RA</forename>, <suffix>Jr</suffix>. </author>

So the block of initials is annotated as <forename>, and post-processing takes care of recognizing the initials (2 letters in upper case) and distribute it as forename and middlename.

As this sequence of names is now present in the training data, it's not a surprise to have the above result, it's a way for checking that correctly tagged sequence get well structured and normalised. I think with new names in a different order with Jr. and Sr. and other suffix, having similar good result in a robust manner will require to have a few more relevant cases in the training data - but only a few!

dominic-sps commented 7 years ago

Great Thank you. I'll check this out.

Regarding suffix samples, there are lot of training data already available grobid-trainer/resources/dataset/name/citation/corpus/standalone.names.tei.xml.exclude

Not sure about the "exclude" file purpose. If you want full reference with different prefix and suffix, I'll arrange.

kermitt2 commented 7 years ago

I assembled this file with suffix and unusual examples of names for this purpose, but using it resulted in a loss of accuracy for author name recognition of 2-4%, so I have excluded it from the training.

I suppose the problem is that's only names in isolation, not sequence of names as found in academic papers. It might also create over-representation of this kind of unusual names in the trained model. So lesson learned, the best is to use actual data as found in academic papers, and not artificially compiled stuff like this file ;)