near / nearcore

Reference client for NEAR Protocol
https://near.org
GNU General Public License v3.0
2.31k stars 618 forks source link

`StoreUpdate::commit` can take a long time #10972

Open bowenwang1996 opened 5 months ago

bowenwang1996 commented 5 months ago

See https://github.com/near/nearcore/issues/10971#issuecomment-2041903107 for an example. This is quite interesting because https://github.com/near/nearcore/blob/2c13078e1e48d6c73f9fdba7ffd78d06e84732cf/core/store/src/lib.rs#L622 is literally the only thing it does.

bowenwang1996 commented 5 months ago

@nagisa do you mind looking into this one? cc @akhi3030

nagisa commented 5 months ago

Observation: while this take a long time, it does look like it taking a long time is not a particularly frequent occurrence. Over the past 6 hours for the one node that is tracing StoreUpdate::commit (I imagine its the one that you've spun up), we've had fewer than 15 of these events. This plot below has a low-pass filter of 120ms to filter out all the fast commits. This shows there is a strong bimodal distribution.

A 6 hour point cloud plot showing the durations of `StoreUpdate::commit`. The graph is pretty sparse, and although a low-pass filter of 120ms was applied, we only see traces between 840ms and 1.1s, which suggests strong bimodal distribution.

If we take a 12-hour view, there are more instances, and a clustering of these long-update events around a certain time, which may be an epoch boundary or a node restart? (How do we tell?)

A 12 hour point cloud plot showing the durations of `StoreUpdate::commit`. The graph is still pretty sparse, but around 7:30am there is visually a tighter clustering of the "observations" in 750ms to 1.1s range

In that sense I agree that it might lead to chunk being missed, but for all intents and purposes it might very well be this being given one large update to apply from some source and it might not be possible to optimize StoreUpdate::commit specifically to go wroom.

bowenwang1996 commented 5 months ago

If we take a 12-hour view, there are more instances, and a clustering of these long-update events around a certain time, which may be an epoch boundary or a node restart? (How do we tell?)

For epoch changes we can overlay it with near_epoch_height to see whether we there is a correlation.

In that sense I agree that it might lead to chunk being missed, but for all intents and purposes it might very well be this being given one large update to apply from some source and it might not be possible to optimize StoreUpdate::commit specifically to go wroom.

While 1s is very rare, I think it routinely goes above 50ms. Maybe it is not trivial given the amount of data we write, but I thought rocksdb does the actual writes in the background

nagisa commented 5 months ago

While 1s is very rare, I think it routinely goes above 50ms. Maybe it is not trivial given the amount of data we write, but I thought rocksdb does the actual writes in the background

Looking back at our profiles we did a while back when investigating the performance issues with the contract runtime, immediate writes appear to my untrained eyes to be happening in the initiating thread to some extent. There's a write to Write-Ahead-Log which goes all the way to touching the filesystem through the kernel from the main thread and then some management of a write batch (which is what, I would imagine, gets flushed into WAL periodically.)

We do indeed have a rocksdb thread, but that is chiefly working on compaction and compression, which as far as I understand is the fancier "write" to the at-rest storage. The one that ends up making the sst files with all the nice properties of sorted data and such.

So if the model of rocksdb was roughly that its APIs basically just shove stuff into a channel for something else to process, then it is probably an oversimplification of what it does. Could we do something like this ourselves? Probably not as we wouldn't be guaranteed to have the data in our store be up-to date when the operations in the next block/chunk/etc. check on that.

Not all is lost though. We could implement some sort of a pipelining idea for our StoreUpdates as described in this message and ones above. This would also help more generally, but is also a non-trivial amount of work. It would not work for network messages, however, as those AFAICT are not part of the usual chunk-to-chunk processing flow. If we wanted to integrate that... well.


One thing that I'm still interested in checking out is to see the exact makeups of the DBTransactions that are slow vs not-quite slow vs very-fast. Is it entirely a function of the number of distinct DBTransactions? Is there some specific type of DBTransaction that makes things slow? But even if we do know this data, my hopes of a quick resolution for this particular issue are minimal.

nagisa commented 5 months ago

For epoch changes we can overlay it with near_epoch_height to see whether we there is a correlation.

No strong correlation with epoch changes for the one node we have traced right now, especially not to the degree of seconds. But our sample size is currently 2.

image

nagisa commented 5 months ago

Looks like there is some correlation with neard restarts and 1-second writes, which probably has something to do with catchup? But it doesn't explain everything, as there are hits after a boot as well. I guess those could also be missed chunks that the node needs to download? Let me find if we have a metric for that.

RocksDB caches being cold could explain why garbage collection takes a long time in those scenarios.

bowenwang1996 commented 5 months ago

Probably not as we wouldn't be guaranteed to have the data in our store be up-to date when the operations in the next block/chunk/etc. check on that.

Yeah that would be nice. I don't think there is a need to block the write on the foreground thread as long as we can catch the error somehow (if any)

which probably has something to do with catchup? But it doesn't explain everything, as there are hits after a boot as well

Maybe, but it is still block/chunk processing, just at a higher frequency, so I am not sure how that affects commit directly

nagisa commented 5 months ago

One instrumentation that I added is the number of transaction ops so that we can see if there is a relationship between the number of ops and duration commit takes to execute. Looking some of the spans they take 700ms to execute even though there's only one transaction for them to process!

Then there's one that needs as much time for 17 transactions and another that needs to handle 12271 transactions (that makes it 50µs per transaction on average, but my suspicion is that we might have one transaction that's slow in there and plenty that are fast...).

I think I'll focus on the 1-transaction commits for now and see if I can glean anything interesting there.


EDIT: Here's a particularly curious one: 0 transactions, 755ms to execute?! It looks like at the time this operation was triggered, a compaction was running at the same time.

Another instance seems to be closely correlated with a WAL flush. I'm waiting for more data to make conclusions with higher sample counts.

Some good reading on Write Stalls but we don't seem to be stalling at all:

** DB Stats **
Uptime(secs): 1203.4 total, 600.1 interval
Cumulative writes: 58K writes, 11M keys, 58K commit groups, 1.0 writes per commit group, ingest: 7.84 GB, 6.67 MB/s
Cumulative WAL: 58K writes, 0 syncs, 58783.00 writes per sync, written: 7.84 GB, 6.67 MB/s
Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 28K writes, 6739K keys, 28K commit groups, 1.0 writes per commit group, ingest: 3654.91 MB, 6.09 MB/s
Interval WAL: 28K writes, 0 syncs, 28621.00 writes per sync, written: 3.57 GB, 6.09 MB/s
Interval stall: 00:00:0.000 H:M:S, 0.0 percent
Write Stall (count): write-buffer-manager-limit-stops: 0,

While 1s is very rare, I think it routinely goes above 50ms. Maybe it is not trivial given the amount of data we write, but I thought rocksdb does the actual writes in the background

FWIW from what I can tell, my node doesn't actually do this currently for some reason. There are more of these writes around 30ms mark, but nothing in between 50ms to 500ms over the past half an hour since I started a fresh node.

nagisa commented 5 months ago

One of the reasons why my and Bowen's view on the severity of this issue may differ is the inconsistent I/O performance between our GCP regions. Exactly the same instance configuration results in pretty wildly different I/O numbers between our machines (note the X-axis!)

Histogram for eu-west showing about 250 observations around 2.5ms duration with a tail extending to 10ms

Histogram for us-west showing about 200 obervations around 5ms duration with a tail extending to 17ms

So a not-terrible way forward here would be to actually draft some neard tool that would allow the operator to quickly estimate whether they got a VM in a "good" neighbourhood.

nagisa commented 5 months ago

Turns out the graphs above have at some point been visualizing the wrong function. The observation still stands, but the graphs look different.

One of the reasons why my and Bowen's view on the severity of this issue may differ is the inconsistent I/O performance between our GCP regions. Exactly the same instance configuration results in pretty obviously different I/O numbers between our machines

image image

So a not-terrible way forward here would be to actually draft some neard tool that would allow the operator to quickly estimate whether they got a VM in a "good" neighbourhood.

That said, we still have some very long spans even on eu-west. Looking at some of the >50ms StoreUpdate::commits that occurred overnight, its again been a mismash of all sorts of things, but one thing remains true: there are occasionally transactions that contain 1 operation to set e.g. BlockMisc:LATEST_KNOWN (col:key) to a value worth 16 bytes that take over 50ms or at a fringe case BlockMisc:LARGEST_TARGET_HEIGHT to a value worth 8 bytes that takes whole 700ms to process (grafana)

Unfortunately RocksDB stopped writing its log file so I do not have the logs from that to check what exactly going on at the time... :cry:

nagisa commented 5 months ago

The additional instrumentation around RocksDB is quite revealing, actually, and so are the logs.

So, I found an instance where there is a RocksDB::write call that starts at 13:33:58.437. and takes 1.74 seconds to process. However this span is not associated with any parent span, and the only knowledge of its origin we have is that its processed by a tokio-runtime-worker thread (so basically could be anywhere...) The only thing I know is that this is absolutely humongous write/transaction as it takes full 800ms to just produce the rocksdb::WriteBatch structure.

Then right after that the suspect 1-transaction RocksDB::write that's part of our long-running StoreUpdate::commit takes place at 13:33:59.256 and runs for 872ms.

The RocksDB log file has the following line at roughly this time:

2024/04/10-13:34:00.128428 136801 [db/db_impl/db_impl_write.cc:1638] Flushing all column families with data in WAL number 56376440. Total log size is 1546364923 while max_total_wal_size is 1073741824

If we trust this log, that 1.7-second long transaction contains at least 500M worth of data to process. I don't know what kind of operations are in this huge transaction, but the fact that it is going into WAL suggests that those are writes of some kind.

The behaviour looks particularly consistent. First there is this humongous write from who knows where. Then there's a small write from somewhere in handle or check_triggers that tries to follow and is "stalled" (vocab warning: this is not a proper RocksDB "stall"...)

My intuition given how the timing aligns between the ends of these spans is that RocksDB can actually do initial processing of the WriteBatch being injected in a single-threaded/blocking fashion. So it gets >=500M to process, and while it is working on copying that out into some internal thread or whatever it can't get to the next WriteBatch which is our long, but light one.

Whatever the operation that causes these huge writes -- it is not frequent and especially not very obviously periodic…

TraceQL: { name = "RocksDB::write" && duration > 250ms }


This investigation already proposes a fix -- write data in smaller transactions. However this also poses a problem -- whatever the task that's causing these extremely long writes is, if it needs to write all that data in a transactional manner, then we'd have to find other ways to make it transactional (snapshots? what do we do about all the writes that might have happened in between?)

nagisa commented 5 months ago

https://github.com/near/nearcore/pull/11018 has improved on the largest outliers tremendously.

There still are the StoreUpdate::commits that take around 50ms or a little more, at most 75ms. Those all are part of the postprocess_block function and they all handle >20k of DBTransactions at once. Unfortunately I don't see how we could possibly reduce this number or split the transaction up, as these are all (as far as I can tell) part of the store operations that contracts have applied in a block.

One avenue of improvement is that we're taking roughly 50/50% of total time of each StoreUpdate::commit by first converting from DBTransaction to a RocksDB WriteBatch and then calling rocksdb_write. I don't see any plausible way for us to improve on that last step, and we will eventually pay for constructing the WriteBatch no matter how we look at it.

However, that raises a question of how much time exactly do we spend on constructing those DBTransactions in the first place?

Well, most of the time outside of rocksdb stuff actually goes into ShardTrie operations. And the reason why these take roughly as much time time as the conversion to WriteBatch is because these two operations do roughly the same sort of thing: they both allocate a copy of the value data (remember there can be as many as 20k of such pieces of data), and these allocations seem to be unavoidable in either case due to how lifetimes work out. While I haven't measured it, I also believe that most of the time in these allocations is actually going into memcpy of the data buffers, rather than the allocator overhead, so I wouldn't expect very much improvement here either.

By subtraction, then, construction of DBTransactions themselves should then be relatively cheap, but that's a surprising conclusion given that DBTransactions also hold Vecs of data. We could explore some tricks to make sure we construct WriteBatches directly here. But that's pretty involved and the gains are debatable.

All this is to say is that I don't think we will be able to extract any appreciable wins here, unfortunately.

nagisa commented 5 months ago

One minor avenue of improvement:

ShardTrie stores value data as Arc<[u8]>s, but when building up a DBTransaction we allocate a fresh Vec<u8>, including for the amplitude of the reference count changes. Since we have the Arcs for data anyway, we could perhaps have DBTransaction store basically only a reference, and for reference counts, we could encode those into a on-stack buffer, since those are the same 8 bytes...

EDIT: never mind: this is complicated by the fact that the reference counts are attached to the value...

nagisa commented 5 months ago

An idea that we had in a call is that we don't actually need the one-big-block-finalize-transaction to block the processing of the other tasks for that block, and perhaps even the processing of parts of the next block. We could write it out in a thread. Remains to be seen how troublesome it'll be to do this, but I didn't immediately think of any major blockers to this...

nagisa commented 4 months ago

I'm not currently working on this anymore, so unassigning. I did make a few attempts to extract the Arc<[u8]> improvement, but I did not get enough uninterrupted time to follow through with it. This is a pretty involved refactor I think, and the wins are probably not very significant either, making this a pretty low-priority thing to work on. I did not attempt to parallelize commits. This is also a pretty involved change.

Otherwise the conclusions in the comments above stand true.