Missing content - Processing citation

kermitt2 / grobid

A machine learning software for extracting information from scholarly documents

Apache License 2.0

3.3k stars 439 forks source link

Hi Team,

We are using the tool for the identification of the entities. please see the scenario below.

On using the process citation service with the sample :

Wanberg, R., Welsh, E. T. & Hezlett, S.A. (2003) Mentoring research: A review and dynamic process model, Research in Personnel and Human Resources Management, 22, 39–124.

The result would be

Mentoring research: A review and dynamic process model, Research in Personnel and Human Resources Management R Wanberg E T Welsh S A Hezlett 22

we are getting the expected result fast and we are amazed to see the results with its speed and accuracy:+1::+1:

But can we get the result TEI xml with the original text position as such in the input with the tags inserted in between the entities like forename, surname, title, issue etc like the below format like below.

The below result will contain all the punctuations like ,.-: and the spaces

<surname>Wanberg</surname>, <forename type="first">R</forename>., <surname>Welsh</surname>, <forename type="first">E</forename>. <forename type="middle">T</forename>. & <surname>Hezlett</surname>, <forename type="first">S</forename>.<forename type="middle">A</forename>. (2003) <title level="m" type="main">Mentoring research: A review and dynamic process model, Research in Personnel and Human Resources Management<title>, <biblScope unit="volume">22</volume>, <biblScope unit="page">39–124</biblScope> .

In the above case we will not loose any commas, dot and semicolon etc [,.;:"] and some contents which are not identified by the machine.

This would be helpful to identify if we are missing some contents in the TEI xml after parsing.

Thanks Dhanayan Shankar

Hello @dhanayanshankar !

This was already the object of issue #221.

Basically no it is not possible to get that in the resulting TEI output. The TEI output is a normalized version representing the logical structure of the content. Representing at the same time logical structure and presentation structure in the same XML document is too complicated and usually impossible for many models (overlapping, non connected elements, etc.).

For getting the position information, you can generate another XML/TEI capturing presentation only, which is what the generation of training data for a given model is doing. For instance if you call the batch for creating training data from the PDF containing this citation, you will get the serialized version with the tags, with the following consequences:

all the syntactic sugar is present (so not only [,.;:"], but also parenthesis, EOL, all the dirt)
only one model is applied, so you don't get the sub-segmentation of the author sequence for instance or or of the date (they are generated in another "training data" files)

Another way to answer the question is that GROBID has not been designed for republishing an existing presentation, but for logical structure extraction. If some content is missing in the structured result, it means that it has been interpreted as "dirt" (like the [,.;:"()\n ... ]) or functional words and the way to fix this is to generate training data for this example (with the dirt) and update the model by retraining.

kermitt2 / grobid

Missing content - Processing citation #304