Open bnewbold opened 5 years ago
I am currently facing the same issue, although on a less powerful machine. When I started the import to the LMDB a couple of days ago it was doing a couple of thousand per second which was going to take a long time but I'm not in that big of a rush and allowed it to run over the weekend.
Now when I checked up on the progress it's doing 25 per second so it probably won't finish this side of christmas.
I'm running everything default in a docker container running on GCP. I ran the import via nohup
to let it run while I disconnected from the shell, don't know if that affects performance somehow.
Also worth to mention is that I faced the same issue when importing the DOI<->PMID mapping dataset where I saw the rate decrease quite rapidly but due to the smaller size of the dataset it finished quite fast anyways.
When experimenting with import of the fatcat release metadata corpus (about 97 million records, similar in size/scope to crossref corpus when abstracts+references removed), I found the Java LMDB importing slower than expected:
I cut it off soon after, when the rate drooped further to ~900/second.
This isn't crazy slow (it would finish in another day or two), but, for instance, the node/elasticsearch ingest of the same corpus completed pretty quickly on the same machine:
This is a ~30 thread machine with 50 GByte RAM and a consumer-grade Samsung 2 TByte SSD. I don't seem to have any lmdb libraries installed, I guess they are vendored in. In my config I have (truncated to relevant bits):
Wondering what kind of performance others are seeing by the "end" of a full crossref corpus import, and if there is other tuning I should do.
For my particular use-case (fatcat matching) it is tempting to redirect to a HTTP REST API which can handle at least hundreds of requests/sec at a couple ms or so latency; this would keep the returned data "fresh" without needing a pipeline to rebuild the LMDB snapshots periodically or continuously. Probably not worth it for most users and most cases. I do think there is a "universal bias" towards the most recent published works though: most people read and are processing new papers, and new papers tend to cite recent (or forthcoming) papers, so having the matching corpus even a month or two out of date could be sub-optimal. The same "freshness" issue would exist with elasticsearch anyways though.