kriszyp / lmdb-js

Simple, efficient, ultra-fast, scalable data store wrapper for LMDB
Other
505 stars 41 forks source link

Feature Request: multiGet/multiRead #113

Closed gogoout closed 2 years ago

gogoout commented 2 years ago

Hi, would it be possible to implement a multiGet api in lmdb? The motivation behind this is to minimize the js/c++ boundary cross performance hit. This feature has recently been implemented in LevelUp, my test shows this is significant faster (30%~100% depending on the size of the keys) than the multiple single get. Some discussion around this can be seen here: https://github.com/Level/leveldown/issues/13

kriszyp commented 2 years ago

Yes, that is something worth considering. However, some context and considerations: There are actually several parts to the js/C++ boundary, which can and often are expensive, but might be of interest to enumerate. First, I'd be curious what kind of performance Levelup/down actually achieves with this type of operation. At least in my tests of Leveldown, the predominant cost of a get/read (assuming it hits cached data) is enqueuing/executing an async event/callback for the response, which is about 10,000ns (these are all rough estimates, in nanoseconds for consistency). This doesn't exist at all in lmdb-js because it runs synchronously, and (cached) reads already seem to run at least 10x faster because of this. With that out of the way, the other most significant expense that can be incurred is instantiation/allocation of a buffer (Buffer+ArrayBuffer, where you would naturally put the read result) which I believe is often roughly 500-1,000ns. lmdb-js actually goes to great lengths to avoid instantiating new buffers during reads, and the most significant performance optimizations in lmdb-js have to do with reusing buffers (a static buffer is used to hold the keys, and a static buffer is used to hold the read result is only recreated and expanded if a value is larger than the current static buffer size). My biggest concern with implementing a multi-get is that it could complicate the techniques for reusing buffers, and anything that triggering new buffer instantiations would probably be slower. If you are looking for faster gets (and minimizing the js/c++ boundary), I think there are actually bigger gains to be had by using the v8 fast-call api builds for lmdb-js (described in the docs, and I think node v16 might now have a stable enough fast-call api that I could start prebuilding it). Anyway, here is list of rough estimates of some of the costs in a get call, off the top of my head, based on my experience:

Anyway, I will look into this, and indeed this may be useful for some performance gains, but just putting this into context; in my experience with all the potential costs involved in real-world apps, the actual JS/C++ calls do not tend to be significant expense, and probably the best way to reduce that expense is using fast-call apis.

gogoout commented 2 years ago

Hi, thanks for the detailed response. I'd like share some use cases of this api. When I doing the test in local environment, We currently open multiple lmdb at the same time, some of the tables are really large (we have one >2000m rows, several other sit at several hundred million), some values are also very large. Those criteria combined together makes even the lmdb struggle. Here are some loggings I found on the production, this happens a lot

dataloader loads 10211 entities takes 4408ms, avg: 0.43169131328958965ms
dataloader loads 6707 entities takes 1308ms, avg: 0.19502012822424333ms
dataloader loads 39266 entities takes 10011ms, avg: 0.2549533947944787ms

However when running on smaller set, usually this can't be observed. I'm not yet 100% sure this is purely lmdb performance, or is there gc involved. But the duration is linearly increasing with the size of the data (both in row size and value size). When dealing with this kind of operations, synchronous get will block the main thread preventing it from doing other job.

Also I think with the multiGet api, if user pass in a key array that is long enough, queue/execute event is also neglectable as it only happen once.

I think I'll just run some benchmark on our dataset and we can provide some more reliable benchmark to see the performance.

Anyway, thanks for looking into this.

kriszyp commented 2 years ago

My bet would be that your reads are primarily slowed by (hard) page faults. You could probably verify this if this is the main issue by checking if it is accompanied by low CPU usage and high disk usage. Also, are you running on SSD or mechanical HD?

I have actually been running some similar types of tests to test performance with large number of expected by hard faults. In my tests, I have been running a 12GB DB on a machine with 4GB of RAM with a mechanical HD (yeah, a pretty pathetic, old computer, but good for assessing performance in constrained situation). In this, case where the DB is much larger than available RAM, random reads will will naturally trigger about one page fault per read, and as expected, reads are similar to the HD seek time (about 10ms, so about 80-100 reads per second). This is, of course, vastly slower (by over 1000x) than performance than on small DB that can be completely cached. And also as expected there is very little CPU in these situations, and with a mechanical HD, I can hear the drive working hard. So with DBs larger than RAM, with highly random/unpredictable access, reads really are dominated by disk/page-faults, and the JS/C++ calls are probably inconsequential.

So what to do about this? First, this pattern is a pretty standard/common bottleneck for any database system. If reads can't be cached, disk I/O starts becoming a heavy constraint. However, there is more we can do. Simply doing sequential reads and synchronously blocking the main thread when there are a lot of hard page-fault inducing disk I/O is not optimal. And this is actually where asynchronous could be very helpful, like you said (even though I have already mentioned the high overhead of event queuing, but you can see from the table above that page faults are still much more expensive). With asynchronous reads, you can avoid blocking the main thread, and also potentially have multiple reads enqueued at once, which can facilitate better queueing/load on the HD.

And having seen this type of pattern in my testing, I have actually already started on a mechanism for asynchronously fetching data. What I am working on is slightly different than a straight up asynchronous multi-get, instead it does a prefetch of multiple keys, and then goes back to the main thread to get them and de-serialize them. The advantage of this approach is that it avoids any mallocs/memory allocations and will hopefully be a bit more efficient that way.

Anyway, if you think this may be beneficial for your situation (and based on what I see, I think it might be), I will try to get the prefetch-for-multiple-gets into the 2.1 release, since it already half completed.

gogoout commented 2 years ago

Thanks for your explanation, it make sense. Our environment is on cloud using gcp's ssd pd (it's not a local disk). But the throughput of the disk is limited by both the space and cpu the node claimed(which ends up at about ~100mb/s). And wow, the prefetch api should work nicely, I think as long as the main thread don't have to wait for disk I/O, it should be a great improvement to our situation. Thanks again. Looking forward to the 2.1 release.

kriszyp commented 2 years ago

One other bit of trivia with this: one of the reasons I was actually doing some of this test was I curious if changing the default page size and/or the read-ahead flag had any impact of performance in large-db, disk-io-bound situations. My conclusion was that for general purpose, where data/values are unpredictable in size and most access is random, that the default 4Kb page size is best, as it provides the most efficient memory representation of the binary trees, and the best performance. However, in situations where entries are fairly similar in size and there is a lot of sequential access (through getRange), a large page size can be beneficial, and there is a clear reason for this: If you are accessing a range of 6 1KB entries, those would be spread across two pages in a 4Kb page db, but would be in a single page in a 8KB page db, and only require a single disk access, rather than potentially two disk reads. And I did indeed see improved getRange performance (fewer page faults) with larger page sizes in that situation. However, be aware that you can't change the page size of an existing database, it must be created from scratch from the page size that it will use. And in most situations 4KB seemed best.

Another option to look at is using the noReadAhead flag. If your access is primarily random (mostly gets, few getRanges), and most of your entries are less than 4KB in size (can fit in single pages), this can potentially improve performance, since read-aheads are mostly wasted data in these situations. This can be changed without changes to database itself, so it easier to test or transition. Also note that this setting shouldn't be used in conjunction with a larger page size (since it defeats the loading of large page sizes in a single read).

kriszyp commented 2 years ago

The prefetch method is available in beta 3 to try out.

gogoout commented 2 years ago

Hi, thanks for the insight, I think I'll run some benchmark to find out about the performance with the change of those flags. I'll get beta 3 a try and let you know

gogoout commented 2 years ago

Regarding the prefetch api, does it play with dupSort as well?

kriszyp commented 2 years ago

Hmm, that's a good question. Right now, it does a get with the provided key, which retrieves the first value for the given key in a dupSort database. Often all the values for a key may be in the same page, but certainly not a guarantee, especially if a key has a lot of values (or they are bigger). For a dupSort database, it may be preferable to access all the values for a key in a prefetch.

I did also intend to eventually add support for prefetching a range (as a precursor to a getRange call).

gogoout commented 2 years ago

Ye, that’s what I’m guessing as well.

gogoout commented 2 years ago

I saw some errors in the log after using getMany api, but I'm not sure if this is caused by the new api or somehow I've got bad data in the db. But I haven't encountered this before

Error: Failed to decompress data
    at LMDBStore.getBinaryFast (/app/node_modules/lmdb/dist/index.cjs:1277:26)
    at LMDBStore.get (/app/node_modules/lmdb/dist/index.cjs:1321:22)
    at /app/node_modules/lmdb │
│ /dist/index.cjs:1613:23
    at /app/node_modules/lmdb/dist/index.cjs:1647:6","level":"error","message":"Error: Failed to decompress data
    at LMDBStore.getBinaryFast (/app/node_modules/lmdb/dist/index.cjs:1277:26)
    at LMDBStore.get (/app/node_modules/lmdb/di │
│ st/index.cjs:1321:22)
    at /app/node_modules/lmdb/dist/index.cjs:1613:23
    at /app/node_modules/lmdb/dist/index.cjs:1647:6

Seems this is reproducible, the app would crash at the same place and restart and crash again, I think this might caused by a shutdown or something.

kriszyp commented 2 years ago

I am not sure how the prefetch API would cause this, it seems more likely that it is bad data in the db. Is it reproducible in that accessing the same entry (with possibly corrupted compressed data), consistently returns this error? Or is there a reproducible way to put data into the database that leads to this? Possibly with an abrupt shutdown? (is it possibly related to overlappingSync?)

gogoout commented 2 years ago

it is reproducible in that accessing the same entry (with possibly corrupted compressed data), consistently returns this error. We're using overlappingSync so it's possible that it's causing it, for now it's not a big deal, I just try catch the get function and return null instead.

kriszyp commented 2 years ago

If you ever get any more clues to steps/setup that lead to a corrupted compressed entry, certainly let me know. We use compression heavily in our application, but I don't ever see this issue. But if it can occur, certainly like to know how.

gogoout commented 2 years ago

Ok, I think this probably due to some unexpected shutdown of the server, I've currently set a grace shutdown period of the server now, will let you know if it reproduced

kriszyp commented 2 years ago

Corruption due to unexpected shutdown would still point to a bug, the whole shadow paging sync sequence is specifically designed to prevent any bad data from being saved during an unexpected shutdown.

gogoout commented 2 years ago

I think this sort of bugs can be captured with property based test (https://www.infoq.com/presentations/testing-techniques-case-study/). It seems pretty hard to reproduce. For now I'm just going to open a new issue tracking this https://github.com/DoctorEvidence/lmdb-js/issues/116