kermitt2 / biblio-glutton

A high performance bibliographic information service: https://biblio-glutton.readthedocs.io
117 stars 15 forks source link

Input data preprocessing to remove noise #55

Open lfoppiano opened 3 years ago

lfoppiano commented 3 years ago

I just found the following problem, although since the data is extracted from a PDF I'm not sure it's the right place where to fix the issue.

The following DOI: 10.1063/1.1905789͔ comes out with a nasty ...

Crossref find the record: https://search.crossref.org/?from_ui=&q=10.1063%2F1.1905789%CD%94 The data is extracted from the publisher version of the manuscript: https://aip.scitation.org/doi/pdf/10.1063/1.1905789

Although I think this is not glutton lookup's responsibility, I think having a small pre-processing that removes crap could be nice anyway .

Update: I've checked and since we lookup by DOI directly from LMDB it's a rather strict matching (we lowercase already)