Closed kermitt2 closed 2 years ago
I tested the loading and indexing from scratch and looks very good.
Anything I can do here to help move this along? Seems like a really nice update
Tried out this branch in a Docker container and ran into some issues.
For the incremental update, it tried to run ../indexing
but the path was not correct as it seems that the folder structure in the container was 1 level deeper now.
When running the incremental update I got a BUNCH (88m log file) of errors similar to the one below. I'm not sure if any documents was indexed because I forgot to check the size before the update ran but at least the file says it was modified at ~09:33
ERROR [2022-02-15 09:33:12,403] com.scienceminer.lookup.storage.lookup.MetadataLookup: Cannot store the entry 10.1553/ita-ms-20-02, {"institution":[{"name":"Institut fuer Technologiefolgenabschaetzung der OEAW","acronym":["ITA"],"place":["Vienna, Austria"]}],"publisher-location":"Vienna","reference-count":0,"publisher":"self","content-domain":{"domain":[],"crossmark-restriction":false},"DOI":"10.1553/ita-ms-20-02","type":"report","created":{"date-parts":[[2021,11,15]],"date-time":"2021-11-15T21:52:18Z","timestamp":1637013138000},"source":"Crossref","is-referenced-by-count":0,"title":["COVID-19 - Voices from Academia (ITA-manu:script 21-02)"],"prefix":"10.1553","member":"418","published-online":{"date-parts":[[2021]]},"deposited":{"date-parts":[[2022,2,15]],"date-time":"2022-02-15T04:57:55Z","timestamp":1644901075000},"score":0.0,"editor":[{"given":"Alexander","family":"Reich","sequence":"first","affiliation":[]}],"issued":{"date-parts":[[2021]]},"references-count":0,"URL":"http://dx.doi.org/10.1553/ita-ms-20-02","published":{"date-parts":[[2021]]}}
! org.lmdbjava.Txn$BadException: Transaction must abort, has a child, or is invalid (-30782)
! at org.lmdbjava.ResultCodeMapper.checkRc(ResultCodeMapper.java:70)
! at org.lmdbjava.Dbi.put(Dbi.java:411)
! at com.scienceminer.lookup.storage.lookup.MetadataLookup.store(MetadataLookup.java:110)
! at com.scienceminer.lookup.storage.lookup.MetadataLookup.lambda$loadFromFile$0(MetadataLookup.java:95)
! at com.scienceminer.lookup.reader.CrossrefJsonlReader.lambda$load$0(CrossrefJsonlReader.java:39)
! at java.util.Iterator.forEachRemaining(Iterator.java:116)
! at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
! at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
! at com.scienceminer.lookup.reader.CrossrefJsonlReader.load(CrossrefJsonlReader.java:33)
! at com.scienceminer.lookup.storage.lookup.MetadataLookup.loadFromFile(MetadataLookup.java:86)
! at com.scienceminer.lookup.utils.crossrefclient.IncrementalLoaderTask$LoadCrossrefFile.run(IncrementalLoaderTask.java:215)
! at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
! at java.util.concurrent.FutureTask.run(FutureTask.java:266)
! at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
! at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
! at java.lang.Thread.run(Thread.java:748)
This issue might be because I didn't enter any API key and got rate-limited but not sure, but as the last entry in the log file I got the following:
ERROR [2022-02-15 09:33:44,208] com.scienceminer.lookup.utils.crossrefclient.IncrementalLoaderTask: Crossref update call failed
! java.lang.Exception: The request to Crossref REST API failed: java.net.SocketTimeoutException thrown during request execution : (,cursor=DnF1ZXJ5VGhlbkZldGNoBgAAAAAFfMc7Fmxkam1HbkpnUWxxbWlCMkxwREpMSFEAAAAABNql-xZxd3ptbnlDVlFrS2ltN0l2dW1uWlJBAAAAAALwHO8WTzVoX2ZEVS1SWnE4ZHBtX2VLZ2NNZwAAAAACv5qxFi14RFJYanphVGUyczg3YnAzem5lTXcAAAAAAsqlvxY1N1JUNFlBQVR3eVZLRWYwZnFvMkRRAAAAAAV8xzwWbGRqbUduSmdRbHFtaUIyTHBESkxIUQ==,filter=from-update-date:2022-02-14,rows=1000)
! Read timed out
! at com.scienceminer.lookup.utils.crossrefclient.IncrementalLoaderTask.run(IncrementalLoaderTask.java:128)
! at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
! at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
! at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
! at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
! at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
! at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
! at java.lang.Thread.run(Thread.java:748)
n=4
currently and it is defined in the configuration file) (#13)config/glutton.yml
)withValidation
option (we always validate against matching score)year
can be passed now as additional metadata and it is used in the matching distanceTo be done:
Loading with CrossRef Torrent Academic dump (January, 7, 2021, 120M records):
Indexing with CrossRef Torrent Academic dump (January, 7, 2021, 120M records):
Current evaluation against 17015 raw references in PMC 1943 sample set (CRF model to parse the raw references prior to matching):
With BiLSTM-CRF_FEATURES model instead of CRF for parsing the raw references prior to matching:
Previous one with smaller index (2019, so in principle easier) was: