kermitt2 / biblio-glutton

A high performance bibliographic information service:
Incremental update; blocking + pairwise matching double step; support all known crossref dumps #66

Closed kermitt2 closed 2 years ago

kermitt2 commented 2 years ago

To be done:

Loading with CrossRef Torrent Academic dump (January, 7, 2021, 120M records):

Indexing with CrossRef Torrent Academic dump (January, 7, 2021, 120M records):

Current evaluation against 17015 raw references in PMC 1943 sample set (CRF model to parse the raw references prior to matching):

17015 bibliographical references processed in 1145.593 seconds, 0.06732841610343815 seconds per bibliographical reference.
Found 16699 DOI

precision:      0.9732918138810708
recall: 0.9552159858947987
f-score:        0.9641691878744736

With BiLSTM-CRF_FEATURES model instead of CRF for parsing the raw references prior to matching:

Found 16752 DOI

precision:      0.9733763132760267
recall: 0.9583308845136644
f-score:        0.9657950069594575

Previous one with smaller index (2019, so in principle easier) was:

17015 bibliographical references processed in 2363.978 seconds, 0.13893493975903615 seconds per bibliographical reference.
Found 16462 DOI

precision:      0.9699307496051512
recall: 0.9384072876873347
f-score:        0.953908653702542
lfoppiano commented 2 years ago

I tested the loading and indexing from scratch and looks very good.

karatekaneen commented 2 years ago

Anything I can do here to help move this along? Seems like a really nice update

karatekaneen commented 2 years ago

Tried out this branch in a Docker container and ran into some issues.


For the incremental update, it tried to run ../indexing but the path was not correct as it seems that the folder structure in the container was 1 level deeper now.

Indexing error

When running the incremental update I got a BUNCH (88m log file) of errors similar to the one below. I'm not sure if any documents was indexed because I forgot to check the size before the update ran but at least the file says it was modified at ~09:33

ERROR [2022-02-15 09:33:12,403] Cannot store the entry 10.1553/ita-ms-20-02, {"institution":[{"name":"Institut fuer Technologiefolgenabschaetzung der OEAW","acronym":["ITA"],"place":["Vienna, Austria"]}],"publisher-location":"Vienna","reference-count":0,"publisher":"self","content-domain":{"domain":[],"crossmark-restriction":false},"DOI":"10.1553/ita-ms-20-02","type":"report","created":{"date-parts":[[2021,11,15]],"date-time":"2021-11-15T21:52:18Z","timestamp":1637013138000},"source":"Crossref","is-referenced-by-count":0,"title":["COVID-19 - Voices from Academia (ITA-manu:script 21-02)"],"prefix":"10.1553","member":"418","published-online":{"date-parts":[[2021]]},"deposited":{"date-parts":[[2022,2,15]],"date-time":"2022-02-15T04:57:55Z","timestamp":1644901075000},"score":0.0,"editor":[{"given":"Alexander","family":"Reich","sequence":"first","affiliation":[]}],"issued":{"date-parts":[[2021]]},"references-count":0,"URL":"","published":{"date-parts":[[2021]]}}
! org.lmdbjava.Txn$BadException: Transaction must abort, has a child, or is invalid (-30782)
! at org.lmdbjava.ResultCodeMapper.checkRc(
! at org.lmdbjava.Dbi.put(
! at
! at$loadFromFile$0(
! at com.scienceminer.lookup.reader.CrossrefJsonlReader.lambda$load$0(
! at java.util.Iterator.forEachRemaining(
! at java.util.Spliterators$IteratorSpliterator.forEachRemaining(
! at$Head.forEach(
! at com.scienceminer.lookup.reader.CrossrefJsonlReader.load(
! at
! at com.scienceminer.lookup.utils.crossrefclient.IncrementalLoaderTask$
! at java.util.concurrent.Executors$
! at
! at java.util.concurrent.ThreadPoolExecutor.runWorker(
! at java.util.concurrent.ThreadPoolExecutor$
! at

Crossref timeout

This issue might be because I didn't enter any API key and got rate-limited but not sure, but as the last entry in the log file I got the following:

ERROR [2022-02-15 09:33:44,208] com.scienceminer.lookup.utils.crossrefclient.IncrementalLoaderTask: Crossref update call failed
! java.lang.Exception: The request to Crossref REST API failed: thrown during request execution :  (,cursor=DnF1ZXJ5VGhlbkZldGNoBgAAAAAFfMc7Fmxkam1HbkpnUWxxbWlCMkxwREpMSFEAAAAABNql-xZxd3ptbnlDVlFrS2ltN0l2dW1uWlJBAAAAAALwHO8WTzVoX2ZEVS1SWnE4ZHBtX2VLZ2NNZwAAAAACv5qxFi14RFJYanphVGUyczg3YnAzem5lTXcAAAAAAsqlvxY1N1JUNFlBQVR3eVZLRWYwZnFvMkRRAAAAAAV8xzwWbGRqbUduSmdRbHFtaUIyTHBESkxIUQ==,filter=from-update-date:2022-02-14,rows=1000)
! Read timed out
! at
! at java.util.concurrent.Executors$
! at java.util.concurrent.FutureTask.runAndReset(
! at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(
! at java.util.concurrent.ScheduledThreadPoolExecutor$
! at java.util.concurrent.ThreadPoolExecutor.runWorker(
! at java.util.concurrent.ThreadPoolExecutor$
! at