kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.48k stars 449 forks source link

[Header Consolidation problem] Crossref provided bad search result, GROBID consumed it #680

Closed andrei-volkau closed 3 years ago

andrei-volkau commented 3 years ago

Let me consider the following document as an example. How to Read A Legal Opinion.pdf GROBID is able to recognize the title and the author correctly without the usage of the "Header consolidation" option.

<title level="a" type="main">HOW TO READ A LEGAL OPINION A GUIDE FOR NEW LAW STUDENTS</title>
                           <persName>
                                <forename type="first">Orin</forename>
                                <forename type="middle">S</forename>
                                <surname>Kerr</surname>
                            </persName>

It is known that the "Header consolidation" option makes it possible to query the Crossref REST API. It makes header parsing more accurate in general. But it was not the case for the above document. Let me clarify that. So GROBID made the following request.

INFO  [2020-11-29 07:55:22,296] org.grobid.core.utilities.crossref.CrossrefRequestTask:  (,query.title=HOW TO READ A LEGAL OPINION A GUIDE FOR NEW LAW STUDENTS,rows=1,query.author=Kerr): New request in the pool

The new result returned by GROBID is the following one.

<title level="a" type="main">How to end mass imprisonment: The legal and cultural strategies of Bryan StevensonReview Essay of Just Mercy, Bryan Stevenson (Random House, 2014) ISBN 978-0-8129</title>
<idno type="DOI">10.3138/utlj.2016r3</idno>

So we are able to see that Crossref was not able to find the original document, but it was able to find a document having a similar title and author. I mean that the correct author of the original paper has Kerrsurname. An author of a new paper has had Kerrsurname also.

Question: Is it possible to prevent GROBID from making such incorrect decisions. I understand that the problem seems to be on the Crossref side, but whether it possible to check the validity of the response coming from Crossref in order to prevent such errors? Thank you in advance for any thoughts on that!

kermitt2 commented 3 years ago

Hi @andrei-volkau !

Indeed this issue is hard to fix when using CrossRef. Basically the best record obtained with CrossRef (via a "citation string" search) is then compared with the metadata extracted by Grobid in a simple manner to discard or not the consolidation.

For something more accurate, you can use the "in house" alternative https://github.com/kermitt2/biblio-glutton see https://github.com/kermitt2/grobid/issues/616

It's heavy to install (it requires a full CrossRef metadata dump, a long indexing step, quite a lot of SDD free), but then it scales without issue. Because biblio-glutton is performing "blocking" (the search) and "matching" (pairwise soft match of metadata fields), it is more accurate to filter spurious consolidations.

In your example, biblio-glutton consolidation indeed avoids the error.

INFO  [2020-11-29 16:32:04,199] org.grobid.core.utilities.glutton.GluttonClient:  (,parseReference=false,query.title=HOW TO READ A LEGAL OPINION A GUIDE FOR NEW LAW STUDENTS,query.author=Kerr): .. executing
INFO  [2020-11-29 16:32:05,260] org.grobid.core.utilities.Consolidation: Consolidation service returns error (404) : Not Found

If you want to test biblio-glutton, you can use the public demo, however it is running on an old CrossRef dump and it's just a demo:

grobid-home/config/grobid.properties

#-------------------- consolidation --------------------
# Define the bibliographical data consolidation service to be used, either "crossref" for CrossRef REST API or "glutton" for https://github.com/kermitt2/biblio-glutton
#grobid.consolidation.service=crossref
grobid.consolidation.service=glutton
org.grobid.glutton.host=cloud.science-miner.com/glutton
org.grobid.glutton.port=0
andrei-volkau commented 3 years ago

Hi @kermitt2!

Many thank for the details regarding the biblio-glutton. I was not understanding how it works previously. I am closing the question.