Question on write performance to one table

kylebernhardy commented 3 years ago

Hi,

In our stress testing we typically write to just one table to see what maximum performance we can get / sec. Our tests writes to 11 dbis in the table, all of the puts are all fairly small. We are running our code on a fairly large machine with 64 Cores & 256 Gb RAM on NVMe drives & are fully utilizing the cores. When running our stress test we max out at 10k txns / sec (1 txn = 11 puts) with around 4k concurrent clients. We are seeing almost no load on cores, typically around 6%. If we lower the number of processes we will get higher transaction rates, going from 64 processes to 30 gives us ~12k txns/s. Going all the way down to 6 processes gives us ~14k txns / s. Performance will also go up with fewer puts / txn (10 processes with 5 puts / txn gives 24k txns/sec as opposed to 13k txns / s with 11 puts). We will also see dramatic improvement in performance if we randomly write to more tables. 10 tables will give us ~ 65k txns / sec (this is with 64 processes). This led me to think that maybe the mutex locking is causing a degradation in performance so we increased the commitDelay, but saw no change in performance.

Each txn is executed via transactionAsync. We are using Node 14 & lmdb-store v 1.5.5.

I was wondering if you have any insight as to how we could optimize to get better utilization.

I can share our server code if that would be helpful.

Thanks so much for all of your hard work,

Kyle

kriszyp commented 3 years ago

With LMDB, there the write mutex is exclusive, there can only be one writer to a table/env at once, so that definitely does limit concurrency scaling. However, the general idea of lmdb-store is that when mutexes are delaying transactions, that should cause more writes to be batched per transaction, resulting in higher write efficiency.

So in your stress testing, are you (a)waiting for a transaction to complete before starting the next transaction? However, if these transactions are all being independently triggered from HTTP clients, then maybe that isn't the case.

Anyway, I'd be happy to take a look at the code and see if I spot any issues.

kriszyp commented 3 years ago

I took another look at this with a profiler on lmdb-store's benchmarks, and I think that perhaps one of the biggest issues is that doing writes using the asynchronous transaction approach is fundamentally much less performant, especially in this type of situations, than the individual asynchronous operation. The reason for this is that the transaction callback code is actually executing inside the write transaction (which is the point of the async txns, and again, only a single writer can be executing at once). That means any transaction callback code is a bottleneck that is exclusive to any other transaction callbacks, for the same table, to be executing. With my benchmarks, the async transaction profiling looks something like:

40% serializing data
40% overhead of translating JS values to binary data for mdb_put
10-15% actual execution of mdb_put Again, this is all occurring inside a write txn, exclusive to another processes. On the other hand, with a single async operation, serialization and translation all occurs outside of the write transaction, and everything is queued up prior to starting the write transaction so only the mdb_put operations occur inside the write transaction. This means that for situations where we are bound by write locks, single async operations can probably often perform 5x as fast, if not more.

And, as you probably noticed, writing to separate tables means separate write locks, so different tables can all do their write operations in parallel (which definitely can influence database design).

And that being said, there looks like the is some room for improvement in how lmdb-store handles the put operations within an async transaction (I see some inefficiencies with key handling and such).

And I would still be curious what your transactions look like and what kind of keys and values you are writing.

kylebernhardy commented 3 years ago

Hi Kris,

I am awaiting each async transaction block. When I remove the await the transaction rate goes to 119k/s however with that setup I cannot honestly ack back a success / fail for each request. I could implement my own internal timer to batch my async transactions. As I do have a centralized write module it would be straightforward, but any guidance you could provide would be helpful.

I am putting together a simple server example that demonstrates our transactions without the many layers we have overlayed the write. High level we have an HTTP request that handles each individual request & we ack back with a response related to success or failure.

Our stores are all the default storage key type of msgpack. Our primary key has versioning enabled and the secondary indices all have dupsort enabled. This will be demonstrated in the sample code I will follow up with shortly.

kylebernhardy commented 3 years ago

Here is the sample server code: https://gist.github.com/kylebernhardy/c8ca325452a9fb47114f5db2725036eb

kriszyp commented 3 years ago

I am awaiting each async transaction block. When I remove the await

No, you don't want to remove the await, you are definitely doing it correctly. I was just curious if the waiting for the transaction to complete was blocking the next transactionAsync, and at least from the server perspective, this setup should work fine for allowing multiple async txn blocks to batch. I presume maybe each "client" is waiting for the last request to finish before issuing the next request, but with 4k clients, should be getting at least a decent amount of concurrency...

When I remove the await the transaction rate goes to 119k/s

I assume this is from the client perspective, or measured on the server? If from the server, it seems like this means you can get higher transaction rate by pressuring it with more requests. Can you also get a higher transaction rate (and still keep the await) by increasing the number of concurrent clients (and thereby increasing the number of blocks that get batched)? I am curious if 40k concurrent clients does better?

Here is the sample server code: https://gist.github.com/kylebernhardy/c8ca325452a9fb47114f5db2725036eb

So in your example, I would assume you would get much higher performance (due to more streamlined write transactions) by using plain async put operations, rather than putting them in a transactionAsync. If you did that you would still be guaranteed atomicity of all sequential puts being executed in the same transaction (by virtue of the fact that they are in the same event turn). However, I am guessing that the reason you are not doing that is because in real-life HarperDB supports user-crafted transactions that may have arbitrary complexity of writes that are based on reads in the same transaction (which is indeed the main use case for transactionAsync)?

Anyway, just looking for some easy opportunities for optimization. However, I think some of the work I've been doing on optimizing native calls will help too, although I wouldn't expect massive gains from that, there are still indeed some fundamental bottlenecks involved in running arbitrary JS during the write txn lock.

kylebernhardy commented 3 years ago

For the load test I'm using k6 and each virtual user (VU) will not fire another request until it's current one is responded to.

The 119k/s is a red herring, in further analysis I was creating a promise "storm" that took a while to fully resolve in the background with no awaits acting as a bottleneck for the requests. Adding more VUs beyond the 4000 I see the transaction rate stay roughly same and an increase in latency. I also played with increasing the commitDelay to 100 and then throwing more VUs at the server and the transaction rate goes up to around 11k/s with 4000 VUs, and again with more VUs we start seeing the degradation in latency.

You are correct, if I just do standard puts my transaction rate goes up to around 13k/s, but having transaction blocks is needed for our use case.

It is interesting that we don't see any performance boost by increasing commitDelay and adding more VUs. We are running on Ubuntu 20.04 on an EXT4 file system. Would you see any issues with that setup?

Thank you again for all of your insight.

kylebernhardy commented 3 years ago

I'm building a multi-process data loader with my payload simulation to see what the performance is. I'll share my results & code with you after I get it running.

kriszyp commented 3 years ago

Here is the gist (derived from yours) that I was testing with: https://gist.github.com/kriszyp/0344b597acb4a483cfe414e17797fb6b My observation: I was testing on my cheap, relatively slow Windows PC, and was getting slower results, about 4K txn/sec. I did some profiling at the C level (in VS studio) and it is heavily bound by LMDB operations and specifically LMDB's WriteFile operations, so it had actually very little to do with JS performance, it was almost entirely LMDB and I/O (or at least OS I/O calls). It sounds like you are getting similar speeds with transactionAsync and plain put's, so I am guessing LMDB is the bottleneck for you too. It might be interesting for you to try some different combinations of useWritemap and noSync just to see how much is actual waiting for I/O completion and how much OS workload.

I think it is also fairly dependent on the size of DB, smaller DBs (like I often use for benchmarking), seem to perform better (probably more caching, less scattered writes).

kylebernhardy commented 3 years ago

Thank you Kris,

This is very helpful. TBH 10k/s on a single table is pretty fast and when we load across multiple tables the scale is even better. Overall I wanted to see if I was doing something incorrect or if there is a way to optimize. This is ultimately an issue of hot spotting, which is common. I will continue to trial different settings and configurations & let you know what I find. How dangerous is useWritemap to use in production? It feels like it's bad but wanted to get your opinion.

kriszyp commented 3 years ago

I don't like using useWritemap as I have seen data corruption caused by it (from a pointer), it has a bunch of complications (latest version doesn't work on windows, doesn't work with child transactions), and doesn't usually seem that much faster. However, I was just curious if made a difference in your tests, as it might provide more insights into bottlenecks and if this is more of an issue of system API calls or truly physical I/O.

kriszyp commented 3 years ago

And yeah, in one respect, it might be a good thing if LMDB commit I/O is the bottleneck since with transactionAsync that all happens off-thread (off the main JS thread), so your JS is free to do other stuff (reads and interact with other tables).

kylebernhardy commented 3 years ago

Thanks for all of the help Kriz. The off main thread writes has been huge for us. All testing I have done that I deem safe creates similar results. Knowing that multiple tables allows us to scale will drive data modeling decisions for implementations.

kriszyp commented 3 years ago

For posterity sake, a couple more data points. I did test this on my closest linux box with useWritemap and noSync (basically no explicit I/O calls are made, lowest data-integrity, data is just written when the OS gets around to it), and got about 66K/s txns, which is over 5x as fast, which is pretty much to be expected when you don't have to wait for an I/O (and verifies that this is pretty I/O bound). Also I ran the tests, as is, with full default sync options. This same linux box has an ol' SATA HDD, and I got about 600txn/s (20x slower, not even in the thousands!). When you are I/O bound, waiting for mechanical platters to write stuff can get pretty slow :).

kriszyp / lmdb-js

Question on write performance to one table #65