kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.3k stars 439 forks source link

Missing content - Processing citation #304

Open dhanayanshankar opened 6 years ago

dhanayanshankar commented 6 years ago

Hi Team,

We are using the tool for the identification of the entities. please see the scenario below.

On using the process citation service with the sample :

Wanberg, R., Welsh, E. T. & Hezlett, S.A. (2003) Mentoring research: A review and dynamic process model, Research in Personnel and Human Resources Management, 22, 39–124.

The result would be

`

Mentoring research: A review and dynamic process model, Research in Personnel and Human Resources Management R Wanberg E T Welsh S A Hezlett 22

`

we are getting the expected result fast and we are amazed to see the results with its speed and accuracy:+1::+1:

But can we get the result TEI xml with the original text position as such in the input with the tags inserted in between the entities like forename, surname, title, issue etc like the below format like below.

The below result will contain all the punctuations like ,.-: and the spaces

<surname>Wanberg</surname>, <forename type="first">R</forename>., <surname>Welsh</surname>, <forename type="first">E</forename>. <forename type="middle">T</forename>. & <surname>Hezlett</surname>, <forename type="first">S</forename>.<forename type="middle">A</forename>. (2003) <title level="m" type="main">Mentoring research: A review and dynamic process model, Research in Personnel and Human Resources Management<title>, <biblScope unit="volume">22</volume>, <biblScope unit="page">39–124</biblScope> .

In the above case we will not loose any commas, dot and semicolon etc [,.;:"] and some contents which are not identified by the machine.

This would be helpful to identify if we are missing some contents in the TEI xml after parsing.

Thanks Dhanayan Shankar

kermitt2 commented 6 years ago

Hello @dhanayanshankar !

This was already the object of issue #221.

Basically no it is not possible to get that in the resulting TEI output. The TEI output is a normalized version representing the logical structure of the content. Representing at the same time logical structure and presentation structure in the same XML document is too complicated and usually impossible for many models (overlapping, non connected elements, etc.).

For getting the position information, you can generate another XML/TEI capturing presentation only, which is what the generation of training data for a given model is doing. For instance if you call the batch for creating training data from the PDF containing this citation, you will get the serialized version with the tags, with the following consequences:

Another way to answer the question is that GROBID has not been designed for republishing an existing presentation, but for logical structure extraction. If some content is missing in the structured result, it means that it has been interpreted as "dirt" (like the [,.;:"()\n ... ]) or functional words and the way to fix this is to generate training data for this example (with the dirt) and update the model by retraining.