Investigate potential lock contention in DBImpl::WriteImpl when writing to the PartitionStore

restatedev / restate

Restate is the platform for building resilient applications that tolerate all infrastructure faults w/o the need for a PhD.

https://docs.restate.dev

Other

1.43k stars 34 forks source link

Investigate potential lock contention in DBImpl::WriteImpl when writing to the PartitionStore #1891

Open tillrohrmann opened 2 weeks ago

tillrohrmann commented 2 weeks ago

While benchmarking Restate, I noticed that we spend a lot of time in rocksd::DBImpl::WriteImpl when trying to commit the PartitionStoreTransaction from the different partition processors. I suspect that this might be cause by lock contention. Unfortunately, the flamegraphs on MacOS don't give more insights.

The results of throughput/parallel with main 361e6a8055965ed94b4cd8810642d846aa25f7df were:

throughput/parallel     time:   [397.84 ms 412.47 ms 426.13 ms]
                        thrpt:  [9.3868 Kelem/s 9.6976 Kelem/s 10.054 Kelem/s]

flamegraph

tillrohrmann commented 2 weeks ago

I've tried a simple experiment where every PartitionStore gets its own RocksDB instance to avoid contention completely. The results of the throughput/parallel benchmark are:

throughput/parallel     time:   [354.54 ms 359.25 ms 364.08 ms]
                        thrpt:  [10.986 Kelem/s 11.134 Kelem/s 11.282 Kelem/s]

and the flamegraph no longer shows time spent on awaiting the lock when writing to the PartitionStore (DBImpl::WriteImpl):

flamegraph