Open dominic-sps opened 7 years ago
I am trying GROBID with one of your grobid-example\src\test\resources\Wang_paperAVE2008.pdf and copied this pdf in \test\in for my following test. I changed the author name to "Rui Wang Jr".
In Windows 7, 64bit with createTrainingHeader:
java -Xmx1024m -jar \grobid\grobid-core\target\grobid-core-0.4.2-SNAPSHOT.one-jar.jar -gH \grobid\grobid-home -gP \grobid\grobid-home\config\grobid.properties -dIn \test\in -dOut \test\out -exe createTrainingHeader
I am looking at the Wang_paperAVE2008.authors.tei.xml file created. Here the results more accurate and are as per my requirement. Most of the content is also present in the output.
Then I am running with processHeader:
java -Xmx1024m -jar \grobid\grobid-core\target\grobid-core-0.4.2-SNAPSHOT.one-jar.jar -gH \grobid\grobid-home -gP \grobid\grobid-home\config\grobid.properties -dIn \test\in -dOut \test\out -exe processHeader
I am looking at the Wang_paperAVE2008.tei.xml file at the author area only. Here the element identification is mostly wrong. Both the commands loads same model files except the first one uses the segmentation\model.wapiti in addition.
I am looking at the content of the XML and not worried about the structure. I see a difference where createTrainingHeader works more properly. Wang_paperAVE2008.pdf
Thanks @dominic-sps for reporting these issues!
For your second post, see issue #200 for explanations and how to have the same via processHeader
as via createTrainingHeader
.
For the first one, these are indeed two separate issues:
these kind of forenames "GD" are not well post-processed, but it should be straightforward to fix that
missing Jr is due to the training data, there are too few examples of Jr and Sr in the name-citation training data. We could add more :)
missing Jr is due to the training data, there are too few examples of Jr and Sr in the name-citation training data. We could add more :)
I have access to major STM publishers' (SpringerNature, Elsevier, Wiley & TnF) header XML files but not the PDFs.
To create the training datasets do I have to use createTrainingHeader
option to process PDFs or I can write an XSLT for the available header XML files and directly generate the *.affiliation.tei.xml
and *.authors.tei.xml
and train my GROBID engine.
In GROBID, author names in reference citations (your first post in this issue) are structured with a different model than author names in the header.
For reference citation, the model name-citation
is used. You need to use createTrainingFullText
on a PDF. Among the different files created you will have those corresponding to the reference citations and those corresponding to names in the reference citations - the later being the one of interest for you.
You can also simply add examples based on the string of authors in the reference string, see the examples under grobid-trainer/resources/dataset/name/citation/corpus/author.tei.xml
for instance.
Thank you and noted the citation related training. I would like to know more about training the header part?
I am currently trying to structure the raw manuscript (new unpublished) in Word format into usable XML format. I am automatically cleaning the Word document and converting into PDF format. Then I am using GROBID to process the PDF file. At first we are targeting only the header part and not body or references.
Now to create the training datasets to process my header part
createTrainingHeader
option to process PDFs and create the required training XML files or *.affiliation.tei.xml
and *.authors.tei.xml
and train my GROBID engine. However, I may not be able to generate the grobid-trainer/resources/dataset/header/corpus/headers/*.header
file.Appreciate your suggestion on the above.
Normally, you have to use createTrainingHeader
, so that you can obtain the *.header files necessary for training, so option 1. You can then correct the header annotations, retrain and focus on the secondary models (the models used in cascade from the result of the header model) the affiliation and header author names models/training files.
I reopen because I will work on both better post-processing of initials and adding more suffix examples in the training data of authors in header.
Thank you for reopening this request. Earlier I fixed it temporarily in
grobid-cpde/src/main/java/org/grobid/code/data/Person.java
commented line 40 and added a line
firstName = f;
It worked well for my issue 1 above (the forename "GD" as "Gd"; "CP" as "Cp" etc.). However. the installation test failed stating that "expected Karl-Heinz got KARL-HEINZ". I am fine with this and I skipped the test and went ahead.
I am not sure how to tag the below names in my training corpus
Clow GD, McKay CP, Simmons Jr. GM, and Wharton RA, Jr.
<author><surname>Wharton</surname> <forename>RA</forename>, <suffix>Jr</suffix>.</author>
or
<author><surname>Wharton</surname> <forename>R</forename><middlename>A</middlename>, <suffix>Jr</suffix>.</author>
Let me know and I can arrange few more training data for Jr. Sr. names
Hello!
I think this is correct to have KARL-HEINZ
normalized into Karl-Heinz
(and the test is here to ensure that). I worked a bit more on the two issues. Your example gives now the following (since commit 677f594f59c9788f28bb516b3e79238989b17589):
<biblStruct >
<analytic>
<title level="a" type="main">Climatological observations and predicted sublimation rates at Lake Hoare</title>
<author>
<persName
xmlns="http://www.tei-c.org/ns/1.0">
<forename type="first">G</forename>
<forename type="middle">D</forename>
<surname>Clow</surname>
</persName>
</author>
<author>
<persName
xmlns="http://www.tei-c.org/ns/1.0">
<forename type="first">C</forename>
<forename type="middle">P</forename>
<surname>Mckay</surname>
</persName>
</author>
<author>
<persName
xmlns="http://www.tei-c.org/ns/1.0">
<forename type="first">G</forename>
<forename type="middle">M</forename>
<surname>Simmons</surname>
<genName>Jr</genName>
</persName>
</author>
<author>
<persName
xmlns="http://www.tei-c.org/ns/1.0">
<forename type="first">R</forename>
<forename type="middle">A</forename>
<surname>Wharton</surname>
<genName>Jr</genName>
</persName>
</author>
</analytic>
<monogr>
<title level="j">Antarctica. Journal of Climate</title>
<imprint>
<biblScope unit="volume">1</biblScope>
<biblScope unit="page" from="715" to="728" />
<date type="published" when="1988" />
</imprint>
</monogr>
</biblStruct>
I think this corresponds to the expected result and formatting.
More training data for suffixes like Sr. Jr would be very welcome, there are almost no example right now. In the training data, I have annotated the sequence Clow GD, McKay CP, Simmons Jr. GM, and Wharton RA, Jr.
as
<author><lastname>Clow</lastname> <forename>GD</forename>, <lastname>McKay</lastname> <forename>CP</forename>, <lastname>Simmons</lastname> <suffix>Jr</suffix>. <forename>GM</forename>, and <lastname>Wharton</lastname> <forename>RA</forename>, <suffix>Jr</suffix>. </author>
So the block of initials is annotated as <forename>
, and post-processing takes care of recognizing the initials (2 letters in upper case) and distribute it as forename and middlename.
As this sequence of names is now present in the training data, it's not a surprise to have the above result, it's a way for checking that correctly tagged sequence get well structured and normalised. I think with new names in a different order with Jr. and Sr. and other suffix, having similar good result in a robust manner will require to have a few more relevant cases in the training data - but only a few!
Great Thank you. I'll check this out.
Regarding suffix samples, there are lot of training data already available
grobid-trainer/resources/dataset/name/citation/corpus/standalone.names.tei.xml.exclude
Not sure about the "exclude" file purpose. If you want full reference with different prefix and suffix, I'll arrange.
I assembled this file with suffix and unusual examples of names for this purpose, but using it resulted in a loss of accuracy for author name recognition of 2-4%, so I have excluded it from the training.
I suppose the problem is that's only names in isolation, not sequence of names as found in academic papers. It might also create over-representation of this kind of unusual names in the trained model. So lesson learned, the best is to use actual data as found in academic papers, and not artificially compiled stuff like this file ;)
I am using latest version 0.4.2 and checked the following issues in Windows 7 as well as CentOS 7 Reference Citation Sample checked: Clow GD, McKay CP, Simmons Jr. GM, and Wharton RA, Jr. 1988. Climatological observations and predicted sublimation rates at Lake Hoare, Antarctica. Journal of Climate 1:715-728.
Issue 1. It changes the forename "GD" as "Gd"; "CP" as "Cp" etc. Issue 2. Captures Jr. as surname and tags "GM" as separate surname without a forename
For the suffix issue, attached a PDF from NCBI related to
grobid-trainer/resources/dataset/name/header/corpus/1468-6708-3-10.authors.tei.xml
1468-6708-3-10.pdf