Open bananaoomarang opened 5 years ago
Hi @bananaoomarang ! Thanks this is an interesting case, to relate to #373 #136
The problems are all due to the PDF stream order. You can see the issue simply by trying to highlight the PDF in the PDF viewer of the web browser.
Using the reading order solves the problem, it is implemented but not activated by default because I need to retrain the header model, and for this to put first the training data of the Header model in the reading order rather the current PDF stream order...
Using the reading order - which can be done nows by un-commenting this line: https://github.com/kermitt2/grobid/blob/master/grobid-core/src/main/java/org/grobid/core/document/DocumentSource.java#L97 ), we then get:
...
<biblStruct>
<analytic>
<author>
<persName
xmlns="http://www.tei-c.org/ns/1.0">
<forename type="first">Brent</forename>
<surname>Taylor</surname>
</persName>
</author>
<author>
<persName
xmlns="http://www.tei-c.org/ns/1.0">
<forename type="first">Elizabeth</forename>
<surname>Miller</surname>
</persName>
</author>
<author>
<persName
xmlns="http://www.tei-c.org/ns/1.0">
<forename type="first">Paddy</forename>
<surname>Farrington</surname>
</persName>
</author>
<author>
<persName
xmlns="http://www.tei-c.org/ns/1.0">
<forename type="first">Maria-Christina</forename>
<surname>Petropoulos</surname>
</persName>
</author>
<author>
<persName
xmlns="http://www.tei-c.org/ns/1.0">
<forename type="first">Isabelle</forename>
<surname>Favot-Mayaud</surname>
</persName>
</author>
<author>
<persName
xmlns="http://www.tei-c.org/ns/1.0">
<forename type="first">Jun</forename>
<surname>Li</surname>
</persName>
</author>
<author>
<persName
xmlns="http://www.tei-c.org/ns/1.0">
<forename type="first">Pauline</forename>
<forename type="middle">A</forename>
<surname>Waight</surname>
</persName>
</author>
<title level="a" type="main">Autism and measles, mumps, and rubella vaccine: no epidemiological evidence for a causal association</title>
</analytic>
<monogr>
<imprint>
<date/>
</imprint>
</monogr>
<note>ARTICLES</note>
</biblStruct>
</sourceDesc>
</fileDesc>
...
<profileDesc>
<abstract>
<div
xmlns="http://www.tei-c.org/ns/1.0">
<p>Background We undertook an epidemiological study to investigate whether measles, mumps, and rubella (MMR) vaccine may be causally associated with autism.</p>
</div>
</abstract>
</profileDesc>
which is much better, and consolidation of header will also work.
Putting the training data of the header model in reading order should then make the rest right (abstract and affiliations). So normally grobid is on track to fix all this range of reading order issues, and it's more a matter of development time to get it finalised.
I understand that Grobid will not extract perfect metadata in every case, but I think this one is interesting nonetheless.
For 10.1016/S0140-6736(99)01239-8 neither the title nor the authors are present in the
teiHeader
section of the document when I run it through Grobid.Title is blank and authors look like:
But the document itself seems fairly clear to the human eye, so may be a good test case!