Incremental update; blocking + pairwise matching double step; support all known crossref dumps

rename node.js module "matching" as "indexing" because it is indexing and not matching anything :)
migrate indexer and lookup to ElasticSearch 7
support of CrossRef dump as single file .xz (GreeneLab), gzip or uncompressed, or directory of gzip (academic torrent) or uncompressed json file in a tar.gz file (Metadata Plus) ; files can be jsonl or json in an array (indexer and lookup) (#39)
fix potential ES index refresh bug
update to gradle 7
add a nicer progress ES indexing progress "bar" with colors :)
as blocking step, we retrieve via ElasticSearch n best records (list of MatchingDocument objects, n=4 currently and it is defined in the configuration file) (#13)
as pairwise matching step, we compute and rank the n best records according to a pairwise distance of the expected/candidate fields (#21)
best ranked candidate is returned if above a matching score threshold
heavy factorization/simplification of the search cases, improve overall runtime
make all components (lookup, indexer and pubmed-glutton) use a unique yaml config file (under config/glutton.yml)
update pubmed-glutton (ES7, gradle 7)
remove withValidation option (we always validate against matching score)
remove some useless endpoints
add a Crossref client
"gap" incremental update using CrossRef REST API (using cursors) (#38, #49)
daily CrossRef update launched by the server at an hour indicated in the config
year can be passed now as additional metadata and it is used in the matching distance
mix matching mode (#12) updated but not used any more, because the full raw ref matching mode is now almost as fast and still much more accurate (96.26 f1-score against 94.75 mixed matching, 0.067s per request versus 0.048s mixed matching)
update readme
parsing & conversion of the medline/pudmed records to the Crossref format (a bit extended to accommodate the extra info, like MeSH), generate a dump similar to Crossref snapshot to be loaded by biblio-glutton

To be done:

document the incremental update (move doc to readthedocs)
unpaywall incremental file update (via unpaywall subscription)
review existing matching (which is super basic at the moment)

Loading with CrossRef Torrent Academic dump (January, 7, 2021, 120M records):

115,972,356 indexed records
around 4 hours
232GB LMDB index volume

Indexing with CrossRef Torrent Academic dump (January, 7, 2021, 120M records):

115,972,356 indexed records
around 6:30 to index (working on the same time on the computer), 4797 records/s
25.94GB index volume

Current evaluation against 17015 raw references in PMC 1943 sample set (CRF model to parse the raw references prior to matching):

17015 bibliographical references processed in 1145.593 seconds, 0.06732841610343815 seconds per bibliographical reference.
Found 16699 DOI

======= GLUTTON API ======= 

precision:      0.9732918138810708
recall: 0.9552159858947987
f-score:        0.9641691878744736

With BiLSTM-CRF_FEATURES model instead of CRF for parsing the raw references prior to matching:

Found 16752 DOI

======= GLUTTON API ======= 

precision:      0.9733763132760267
recall: 0.9583308845136644
f-score:        0.9657950069594575

Previous one with smaller index (2019, so in principle easier) was:

======= GLUTTON API ======= 

17015 bibliographical references processed in 2363.978 seconds, 0.13893493975903615 seconds per bibliographical reference.
Found 16462 DOI

precision:      0.9699307496051512
recall: 0.9384072876873347
f-score:        0.953908653702542

Tried out this branch in a Docker container and ran into some issues.

Path

For the incremental update, it tried to run ../indexing but the path was not correct as it seems that the folder structure in the container was 1 level deeper now.

Indexing error

When running the incremental update I got a BUNCH (88m log file) of errors similar to the one below. I'm not sure if any documents was indexed because I forgot to check the size before the update ran but at least the file says it was modified at ~09:33

ERROR [2022-02-15 09:33:12,403] com.scienceminer.lookup.storage.lookup.MetadataLookup: Cannot store the entry 10.1553/ita-ms-20-02, {"institution":[{"name":"Institut fuer Technologiefolgenabschaetzung der OEAW","acronym":["ITA"],"place":["Vienna, Austria"]}],"publisher-location":"Vienna","reference-count":0,"publisher":"self","content-domain":{"domain":[],"crossmark-restriction":false},"DOI":"10.1553/ita-ms-20-02","type":"report","created":{"date-parts":[[2021,11,15]],"date-time":"2021-11-15T21:52:18Z","timestamp":1637013138000},"source":"Crossref","is-referenced-by-count":0,"title":["COVID-19 - Voices from Academia (ITA-manu:script 21-02)"],"prefix":"10.1553","member":"418","published-online":{"date-parts":[[2021]]},"deposited":{"date-parts":[[2022,2,15]],"date-time":"2022-02-15T04:57:55Z","timestamp":1644901075000},"score":0.0,"editor":[{"given":"Alexander","family":"Reich","sequence":"first","affiliation":[]}],"issued":{"date-parts":[[2021]]},"references-count":0,"URL":"http://dx.doi.org/10.1553/ita-ms-20-02","published":{"date-parts":[[2021]]}}
! org.lmdbjava.Txn$BadException: Transaction must abort, has a child, or is invalid (-30782)
! at org.lmdbjava.ResultCodeMapper.checkRc(ResultCodeMapper.java:70)
! at org.lmdbjava.Dbi.put(Dbi.java:411)
! at com.scienceminer.lookup.storage.lookup.MetadataLookup.store(MetadataLookup.java:110)
! at com.scienceminer.lookup.storage.lookup.MetadataLookup.lambda$loadFromFile$0(MetadataLookup.java:95)
! at com.scienceminer.lookup.reader.CrossrefJsonlReader.lambda$load$0(CrossrefJsonlReader.java:39)
! at java.util.Iterator.forEachRemaining(Iterator.java:116)
! at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
! at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
! at com.scienceminer.lookup.reader.CrossrefJsonlReader.load(CrossrefJsonlReader.java:33)
! at com.scienceminer.lookup.storage.lookup.MetadataLookup.loadFromFile(MetadataLookup.java:86)
! at com.scienceminer.lookup.utils.crossrefclient.IncrementalLoaderTask$LoadCrossrefFile.run(IncrementalLoaderTask.java:215)
! at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
! at java.util.concurrent.FutureTask.run(FutureTask.java:266)
! at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
! at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
! at java.lang.Thread.run(Thread.java:748)

Crossref timeout

This issue might be because I didn't enter any API key and got rate-limited but not sure, but as the last entry in the log file I got the following:

ERROR [2022-02-15 09:33:44,208] com.scienceminer.lookup.utils.crossrefclient.IncrementalLoaderTask: Crossref update call failed
! java.lang.Exception: The request to Crossref REST API failed: java.net.SocketTimeoutException thrown during request execution :  (,cursor=DnF1ZXJ5VGhlbkZldGNoBgAAAAAFfMc7Fmxkam1HbkpnUWxxbWlCMkxwREpMSFEAAAAABNql-xZxd3ptbnlDVlFrS2ltN0l2dW1uWlJBAAAAAALwHO8WTzVoX2ZEVS1SWnE4ZHBtX2VLZ2NNZwAAAAACv5qxFi14RFJYanphVGUyczg3YnAzem5lTXcAAAAAAsqlvxY1N1JUNFlBQVR3eVZLRWYwZnFvMkRRAAAAAAV8xzwWbGRqbUduSmdRbHFtaUIyTHBESkxIUQ==,filter=from-update-date:2022-02-14,rows=1000)
! Read timed out
! at com.scienceminer.lookup.utils.crossrefclient.IncrementalLoaderTask.run(IncrementalLoaderTask.java:128)
! at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
! at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
! at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
! at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
! at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
! at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
! at java.lang.Thread.run(Thread.java:748)

kermitt2 / biblio-glutton