Consolidate Headers causing completely wrong paper headers to show up

DavidBegert commented 5 years ago

I am using the web service, and testing this pdf doc: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6571572/pdf/jcm-08-00590.pdf

When I run "Process Fulltext Document" with the 'consolidate headers' option on, the title that gets extracted is "High-dose versus standard-dose daunorubicin in induction therapy for young patients withde novoacute myeloid leukaemia: a meta-analysis of randomised trials". I am not sure where it is getting this title from, as it (or the doi) are not even mentioned in the uploaded pdf?

Any idea what is going on? I saw the same thing with another pdf file as well. Should I just run without consolidating headers?

Thanks!

kermitt2 commented 5 years ago

Hello !

Interesting error, thank you David. Actually it's working more or less okay when the consolidation service is CrossRef.

Here is the reason of this weird error:

GROBID extracts almost everything well, including the DOI and the authors, only the title is a bit incorrect: "Clinical Medecine" is concatenated to the rest of the correct title.
With CrossRef as consolidation service, the DOI matches and provides a corrected title, however the author names on CrossRef do not have fornames, so it's not very good neither
With biblio-glutton as consolidation service (as with the web service demo), the DOI is not matched because the article is too recent... biblio-glutton is built with a snapshot of crossref from September 2018, the article is from April, 29th, 2019
as fallback, biblio-glutton tries to match a DOI record based on the extracted title and first author, and the best it finds is an article with this completely wrong title and authors and so on
Unfortunately the post-validation, where we check that the DOI record found by soft matching is not too far from the extracted title and first author name is not activated and does not filter out this bad result, and we have this completely wrong consolidation.

So one way to fix this issue is to force a post-validation, this would avoid such a completely wrong consolidation to appear. I am going to fix that, but overall I would need to improve the consolidation of headers (better merging of what is extracted and what th consolidation service provides).

Another aspect would be to try to be more up-to-date with CrossRef data in biblio-glutton (at least in the online web service). We could also use CrossRef as fallback when glutton fails or for recent publication dates (the publication date is correctly extracted so we would know that it's too recent for biblio-glutton), or limit the usage of CrossRef to header metadata, and keep glutton for bibliographical references (biblio-glutton is faster, more scalable and more reliable than crossref for matching).

Sometimes freshness is more important that speed and accuracy :)

kermitt2 commented 5 years ago

I push a fix with 9a203381985cc9d9759ae98efd7b5c1cebea7afc Post validation is now applied as expected, avoiding the wrong paper metadata to be used. For testing, Web demo has been updated too.

DavidBegert commented 5 years ago

Great - Thanks a lot for your speedy reply and resolution! :)

kermitt2 / grobid

Consolidate Headers causing completely wrong paper headers to show up #461