Closed anafandon closed 2 years ago
@anafandon thank you for the interesting case.
I did not manage to test it myself, however the cause seems related to the consolidation service.
The grobid client is requesting the header consolidation (could you please confirm that?), grobid uses crossref
which, in this case, seems returning the wrong result.
In fact, if I use the title + first author last name on crossref I obtain the same wrong result as you are pointing out: https://search.crossref.org/?from_ui=&q=Understanding+the+Behaviors+of+BERT+in+Ranking+Qiao
You could test by calling grobid via cURL using the consolidation service (ref):
curl -v --form consolidateHeader=1 --form input=@./understanding_bert.pdf localhost:8070/api/processHeaderDocument
and see whether you obtain the same wrong result.
In general, if you are working to process large amounts of PDF documents, we recommend not to use Crossref in favour of biblio-glutton which offer slightly better results (ref) and less traffic limitations than the overloaded Crossref service.
Hi, @lfoppiano , thanks a lot for the instant response.
I checked the curl command you said and I get also the wrong title as well in the curl now. And actually turning the consolidateHeader to 0 in my python api, gives me the correct result I want.
In fact, it seems this is exactly what I needed, I never wanted to use CrossRef. I needed just grobid to give me the results without any consolidation at all, since I have my own system to do it that.
I consider my issue closed.
Thanks a lot, and again, congrats for all the hard work you'd put to grobid till now!
Cheers
Hi @anafandon
To complement Luca's answer:
You can have a look at the documentation on consolidation, which explains the pro and cons, and why it is activated by default in the python client.
CrossRef is not very satisfactory but easily available. biblio-glutton as consolidation service will actually give you a valid matching/consolidation for your document for instance by avoiding this error (you can test your document with the online demo to see a biblio-glutton consolidation - if the demo is not down :).
Drawback of biblio-glutton: it's very heavy to install because it is indexing the whole CrossRef metadata
Thanks a lot for your response as well @kermitt2 :)
I actually read quite few times the documentation, since you have it well written. The problem is that if you are newbie it takes time to "digest" the concept of consolidation and set your expectations accordingly. The info is there, but your brain takes time to comprehend it!
Regarding biblio-glutton I actually have already an huge elasticsearch cluster with millions of metadata, so now I can do the consolidation myself. Though, what would be helpful from the biblio-glutton's github, would be if you can point me to the file that you actually perform the elasticsearch query for finding the proper reference. I tried to find it but I couldn't because I am mostly native to python.
Dear grobid team,
I hope you are good and healthy. I'll jump straight to the problem.
INFO
version_used: docker image grobid/grobid:0.7.0
PROBLEM
For several pdfs the python grobid client give incorrect results (e.g. it returns title "Understanding Energy Absorption Behaviors of Nanoporous Materials" instead of "Understanding the Behaviors of BERT in Ranking" while when I tested with the curl request
curl -v --form input=@./understanding_bert.pdf localhost:8070/api/processHeaderDocument
I am getting the proper title "Understanding the Behaviors of BERT in Ranking"LOGS The wierd thing are the following two: (1) I am getting a Warning for invalid header cookie and I don't know if this is something that cause the problem. (2) whenever I use the python client and I do a request after I check the logs, in reality I can see that the title found is the proper one!! :
Though instead the tei file returned has the wrong title:
While whenever I do a curl request the logs looks normal:
and the tei file returned has the correct title:
YAML FILE Also, In the yaml file I am using the consolidation service: "crossref" (i am mention this since, it think it might have something to do with it)
QUESTION
What do you think is happening? :)