kermitt2 / biblio-glutton

A high performance bibliographic information service: https://biblio-glutton.readthedocs.io
127 stars 16 forks source link

Slower-than-expected LMDB import; tuning? #36

Open bnewbold opened 5 years ago

bnewbold commented 5 years ago

When experimenting with import of the fatcat release metadata corpus (about 97 million records, similar in size/scope to crossref corpus when abstracts+references removed), I found the Java LMDB importing slower than expected:

    java -jar lookup/build/libs/lookup-service-1.0-SNAPSHOT-onejar.jar fatcat --input /srv/biblio-glutton/datasets/release_export_expanded.json.gz /srv/biblio-glutton/config/biblio-glutton.yaml

    [...]
    -- Meters ----------------------------------------------------------------------
    fatcatLookup
                 count = 1146817
             mean rate = 8529.29 events/second
         1-minute rate = 8165.53 events/second
         5-minute rate = 6900.17 events/second
        15-minute rate = 6358.60 events/second

    [...] RAN OVER NIGHT

    6/26/19 4:32:11 PM =============================================================

    -- Meters ----------------------------------------------------------------------
    fatcatLookup
                 count = 37252787
             mean rate = 1474.81 events/second
         1-minute rate = 1022.73 events/second
         5-minute rate = 1022.72 events/second
        15-minute rate = 1005.36 events/second
    [...]

I cut it off soon after, when the rate drooped further to ~900/second.

This isn't crazy slow (it would finish in another day or two), but, for instance, the node/elasticsearch ingest of the same corpus completed pretty quickly on the same machine:


    [...]
    Loaded 2131000 records in 296.938 s (8547.008547008547 record/s)
    Loaded 2132000 records in 297.213 s (7142.857142857142 record/s)
    Loaded 2133000 records in 297.219 s (3816.793893129771 record/s)
    Loaded 2134000 records in 297.265 s (12987.012987012988 record/s)
    Loaded 2135000 records in 297.364 s (13513.513513513513 record/s)
    [...]
    Loaded 98076000 records in 22536.231 s (9433.962264150943 record/s)
    Loaded 98077000 records in 22536.495 s (9090.90909090909 record/s)

This is a ~30 thread machine with 50 GByte RAM and a consumer-grade Samsung 2 TByte SSD. I don't seem to have any lmdb libraries installed, I guess they are vendored in. In my config I have (truncated to relevant bits):

storage: /srv/biblio-glutton/data/db
batchSize: 10000
maxAcceptedRequests: -1

server:
  type: custom
  applicationConnectors:
  - type: http
    port: 8080
  adminConnectors:
  - type: http
    port: 8081
  registerDefaultExceptionMappers: false
  maxThreads: 2048
  maxQueuedRequests: 2048
  acceptQueueSize: 2048

Wondering what kind of performance others are seeing by the "end" of a full crossref corpus import, and if there is other tuning I should do.

For my particular use-case (fatcat matching) it is tempting to redirect to a HTTP REST API which can handle at least hundreds of requests/sec at a couple ms or so latency; this would keep the returned data "fresh" without needing a pipeline to rebuild the LMDB snapshots periodically or continuously. Probably not worth it for most users and most cases. I do think there is a "universal bias" towards the most recent published works though: most people read and are processing new papers, and new papers tend to cite recent (or forthcoming) papers, so having the matching corpus even a month or two out of date could be sub-optimal. The same "freshness" issue would exist with elasticsearch anyways though.

karatekaneen commented 4 years ago

I am currently facing the same issue, although on a less powerful machine. When I started the import to the LMDB a couple of days ago it was doing a couple of thousand per second which was going to take a long time but I'm not in that big of a rush and allowed it to run over the weekend.

Now when I checked up on the progress it's doing 25 per second so it probably won't finish this side of christmas. I'm running everything default in a docker container running on GCP. I ran the import via nohup to let it run while I disconnected from the shell, don't know if that affects performance somehow.

Also worth to mention is that I faced the same issue when importing the DOI<->PMID mapping dataset where I saw the rate decrease quite rapidly but due to the smaller size of the dataset it finished quite fast anyways.