kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.59k stars 459 forks source link

Consolidate Headers causing completely wrong paper headers to show up #461

Closed DavidBegert closed 5 years ago

DavidBegert commented 5 years ago

I am using the web service, and testing this pdf doc: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6571572/pdf/jcm-08-00590.pdf

When I run "Process Fulltext Document" with the 'consolidate headers' option on, the title that gets extracted is "High-dose versus standard-dose daunorubicin in induction therapy for young patients withde novoacute myeloid leukaemia: a meta-analysis of randomised trials". I am not sure where it is getting this title from, as it (or the doi) are not even mentioned in the uploaded pdf?

Any idea what is going on? I saw the same thing with another pdf file as well. Should I just run without consolidating headers?

Thanks!

kermitt2 commented 5 years ago

Hello !

Interesting error, thank you David. Actually it's working more or less okay when the consolidation service is CrossRef.

Here is the reason of this weird error:

So one way to fix this issue is to force a post-validation, this would avoid such a completely wrong consolidation to appear. I am going to fix that, but overall I would need to improve the consolidation of headers (better merging of what is extracted and what th consolidation service provides).

Another aspect would be to try to be more up-to-date with CrossRef data in biblio-glutton (at least in the online web service). We could also use CrossRef as fallback when glutton fails or for recent publication dates (the publication date is correctly extracted so we would know that it's too recent for biblio-glutton), or limit the usage of CrossRef to header metadata, and keep glutton for bibliographical references (biblio-glutton is faster, more scalable and more reliable than crossref for matching).

Sometimes freshness is more important that speed and accuracy :)

kermitt2 commented 5 years ago

I push a fix with 9a203381985cc9d9759ae98efd7b5c1cebea7afc Post validation is now applied as expected, avoiding the wrong paper metadata to be used. For testing, Web demo has been updated too.

DavidBegert commented 5 years ago

Great - Thanks a lot for your speedy reply and resolution! :)