Closed andrei-volkau closed 3 years ago
Hi @andrei-volkau !
Indeed this issue is hard to fix when using CrossRef. Basically the best record obtained with CrossRef (via a "citation string" search) is then compared with the metadata extracted by Grobid in a simple manner to discard or not the consolidation.
For something more accurate, you can use the "in house" alternative https://github.com/kermitt2/biblio-glutton see https://github.com/kermitt2/grobid/issues/616
It's heavy to install (it requires a full CrossRef metadata dump, a long indexing step, quite a lot of SDD free), but then it scales without issue. Because biblio-glutton is performing "blocking" (the search) and "matching" (pairwise soft match of metadata fields), it is more accurate to filter spurious consolidations.
In your example, biblio-glutton consolidation indeed avoids the error.
INFO [2020-11-29 16:32:04,199] org.grobid.core.utilities.glutton.GluttonClient: (,parseReference=false,query.title=HOW TO READ A LEGAL OPINION A GUIDE FOR NEW LAW STUDENTS,query.author=Kerr): .. executing
INFO [2020-11-29 16:32:05,260] org.grobid.core.utilities.Consolidation: Consolidation service returns error (404) : Not Found
If you want to test biblio-glutton, you can use the public demo, however it is running on an old CrossRef dump and it's just a demo:
grobid-home/config/grobid.properties
#-------------------- consolidation --------------------
# Define the bibliographical data consolidation service to be used, either "crossref" for CrossRef REST API or "glutton" for https://github.com/kermitt2/biblio-glutton
#grobid.consolidation.service=crossref
grobid.consolidation.service=glutton
org.grobid.glutton.host=cloud.science-miner.com/glutton
org.grobid.glutton.port=0
Hi @kermitt2!
Many thank for the details regarding the biblio-glutton. I was not understanding how it works previously. I am closing the question.
Let me consider the following document as an example. How to Read A Legal Opinion.pdf GROBID is able to recognize the title and the author correctly without the usage of the "Header consolidation" option.
It is known that the "Header consolidation" option makes it possible to query the Crossref REST API. It makes header parsing more accurate in general. But it was not the case for the above document. Let me clarify that. So GROBID made the following request.
The new result returned by GROBID is the following one.
So we are able to see that Crossref was not able to find the original document, but it was able to find a document having a similar title and author. I mean that the correct author of the original paper has
Kerr
surname. An author of a new paper has hadKerr
surname also.Question: Is it possible to prevent GROBID from making such incorrect decisions. I understand that the problem seems to be on the Crossref side, but whether it possible to check the validity of the response coming from Crossref in order to prevent such errors? Thank you in advance for any thoughts on that!