kermitt2 / biblio-glutton

A high performance bibliographic information service: https://biblio-glutton.readthedocs.io
125 stars 16 forks source link

GROBID call too async :) #22

Closed kermitt2 closed 5 years ago

kermitt2 commented 5 years ago

The call to grobid in the case of biblio only query (without author) is async, and the response comes after that the non-validated result is returned... so there is no post-validation in this case.

https://github.com/kermitt2/biblio-glutton/blob/fb46c956265a02dabb7045988c3a8b25f2b9446c/lookup/src/main/java/com/scienceminer/lookup/storage/LookupEngine.java#L281

on this line the call to grobid is sent, but the response will come after that we reach this part:

https://github.com/kermitt2/biblio-glutton/blob/fb46c956265a02dabb7045988c3a8b25f2b9446c/lookup/src/main/java/com/scienceminer/lookup/storage/LookupEngine.java#L300

and the non validated DOI match is returned first whatever GROBID will return as author (or non author).

example query: ?biblio=Reporting Hospital Quality Data for Annual Payment Update. Avail- able at: http://www.cms.gov/Medicare/Quality-Initiatives-Patient- Assessment-Instruments/HospitalQualityInits/Downloads/Hospital- RHQDAPU200808. Accessed December 18, 2013

-> match DOI 10.1037/e556322006-027 (wrong one)

-> no author in this DOI, no author found by GROBID, but glutton returns this DOI record as result

-> the call to grobid should be sync, or it is necessary to find a way to wait for its answer to go on with the injectIdsByDoi() or grobid should be integrated as a library.

lfoppiano commented 5 years ago

It was a nasty bug :-)

It should be fine now, however I think some of the problems at #21 might be solved, a quick test gave me Best bibliographical record did not passed the post-validation

kermitt2 commented 5 years ago

In fact the grobid response stax parser was taking the last author last name instead of the first author last name, so post validation via additional grobid call was always failing and it was not related to #21 (fixed with 763d43b54c98e47e2a7a6e0744c3a97f119714cf)

It looks good now after a few tests! cool !!

lfoppiano commented 5 years ago

OK actually it was overriding the author and only the last was left... I misunderstood the 'type=first' attribute ... I though it was "first = first author" not first = firstname ...