kermitt2 / biblio-glutton

A high performance bibliographic information service: https://biblio-glutton.readthedocs.io
117 stars 15 forks source link

Incremental update; blocking + pairwise matching double step; support all known crossref dumps #66

Closed kermitt2 closed 2 years ago

kermitt2 commented 2 years ago

To be done:

Loading with CrossRef Torrent Academic dump (January, 7, 2021, 120M records):

Indexing with CrossRef Torrent Academic dump (January, 7, 2021, 120M records):

Current evaluation against 17015 raw references in PMC 1943 sample set (CRF model to parse the raw references prior to matching):

17015 bibliographical references processed in 1145.593 seconds, 0.06732841610343815 seconds per bibliographical reference.
Found 16699 DOI

======= GLUTTON API ======= 

precision:      0.9732918138810708
recall: 0.9552159858947987
f-score:        0.9641691878744736

With BiLSTM-CRF_FEATURES model instead of CRF for parsing the raw references prior to matching:

Found 16752 DOI

======= GLUTTON API ======= 

precision:      0.9733763132760267
recall: 0.9583308845136644
f-score:        0.9657950069594575

Previous one with smaller index (2019, so in principle easier) was:

======= GLUTTON API ======= 

17015 bibliographical references processed in 2363.978 seconds, 0.13893493975903615 seconds per bibliographical reference.
Found 16462 DOI

precision:      0.9699307496051512
recall: 0.9384072876873347
f-score:        0.953908653702542
lfoppiano commented 2 years ago

I tested the loading and indexing from scratch and looks very good.

karatekaneen commented 2 years ago

Anything I can do here to help move this along? Seems like a really nice update

karatekaneen commented 2 years ago

Tried out this branch in a Docker container and ran into some issues.

Path

For the incremental update, it tried to run ../indexing but the path was not correct as it seems that the folder structure in the container was 1 level deeper now.

Indexing error

When running the incremental update I got a BUNCH (88m log file) of errors similar to the one below. I'm not sure if any documents was indexed because I forgot to check the size before the update ran but at least the file says it was modified at ~09:33

ERROR [2022-02-15 09:33:12,403] com.scienceminer.lookup.storage.lookup.MetadataLookup: Cannot store the entry 10.1553/ita-ms-20-02, {"institution":[{"name":"Institut fuer Technologiefolgenabschaetzung der OEAW","acronym":["ITA"],"place":["Vienna, Austria"]}],"publisher-location":"Vienna","reference-count":0,"publisher":"self","content-domain":{"domain":[],"crossmark-restriction":false},"DOI":"10.1553/ita-ms-20-02","type":"report","created":{"date-parts":[[2021,11,15]],"date-time":"2021-11-15T21:52:18Z","timestamp":1637013138000},"source":"Crossref","is-referenced-by-count":0,"title":["COVID-19 - Voices from Academia (ITA-manu:script 21-02)"],"prefix":"10.1553","member":"418","published-online":{"date-parts":[[2021]]},"deposited":{"date-parts":[[2022,2,15]],"date-time":"2022-02-15T04:57:55Z","timestamp":1644901075000},"score":0.0,"editor":[{"given":"Alexander","family":"Reich","sequence":"first","affiliation":[]}],"issued":{"date-parts":[[2021]]},"references-count":0,"URL":"http://dx.doi.org/10.1553/ita-ms-20-02","published":{"date-parts":[[2021]]}}
! org.lmdbjava.Txn$BadException: Transaction must abort, has a child, or is invalid (-30782)
! at org.lmdbjava.ResultCodeMapper.checkRc(ResultCodeMapper.java:70)
! at org.lmdbjava.Dbi.put(Dbi.java:411)
! at com.scienceminer.lookup.storage.lookup.MetadataLookup.store(MetadataLookup.java:110)
! at com.scienceminer.lookup.storage.lookup.MetadataLookup.lambda$loadFromFile$0(MetadataLookup.java:95)
! at com.scienceminer.lookup.reader.CrossrefJsonlReader.lambda$load$0(CrossrefJsonlReader.java:39)
! at java.util.Iterator.forEachRemaining(Iterator.java:116)
! at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
! at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
! at com.scienceminer.lookup.reader.CrossrefJsonlReader.load(CrossrefJsonlReader.java:33)
! at com.scienceminer.lookup.storage.lookup.MetadataLookup.loadFromFile(MetadataLookup.java:86)
! at com.scienceminer.lookup.utils.crossrefclient.IncrementalLoaderTask$LoadCrossrefFile.run(IncrementalLoaderTask.java:215)
! at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
! at java.util.concurrent.FutureTask.run(FutureTask.java:266)
! at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
! at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
! at java.lang.Thread.run(Thread.java:748)

Crossref timeout

This issue might be because I didn't enter any API key and got rate-limited but not sure, but as the last entry in the log file I got the following:

ERROR [2022-02-15 09:33:44,208] com.scienceminer.lookup.utils.crossrefclient.IncrementalLoaderTask: Crossref update call failed
! java.lang.Exception: The request to Crossref REST API failed: java.net.SocketTimeoutException thrown during request execution :  (,cursor=DnF1ZXJ5VGhlbkZldGNoBgAAAAAFfMc7Fmxkam1HbkpnUWxxbWlCMkxwREpMSFEAAAAABNql-xZxd3ptbnlDVlFrS2ltN0l2dW1uWlJBAAAAAALwHO8WTzVoX2ZEVS1SWnE4ZHBtX2VLZ2NNZwAAAAACv5qxFi14RFJYanphVGUyczg3YnAzem5lTXcAAAAAAsqlvxY1N1JUNFlBQVR3eVZLRWYwZnFvMkRRAAAAAAV8xzwWbGRqbUduSmdRbHFtaUIyTHBESkxIUQ==,filter=from-update-date:2022-02-14,rows=1000)
! Read timed out
! at com.scienceminer.lookup.utils.crossrefclient.IncrementalLoaderTask.run(IncrementalLoaderTask.java:128)
! at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
! at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
! at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
! at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
! at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
! at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
! at java.lang.Thread.run(Thread.java:748)