Performance degrading a lot with high number of keys

crystalin commented 1 year ago

Running Strorage Benchmark on 3 different networks with significant state size/content results in incoherent results. We have beeen using Moonbeam v0.32.1 which is based on substrate 0.9.40. The network Alphanet and Moonriver have similar state/usage overall, but Moonbeam had a project that generate a huge amount of storage (all of the same size, 42 bytes IIRC).

As you can see, the Moonbeam read and write using paritydb are way off the expected result that we see in alphanet and moonriver.

Configuration of the disk is AWS gp3 | 1000 GiB | 3000 IOPS and each network/db has its own disk (total of 6 disks). The blocks and state are pruned to avoid having a huge disk space.

Running the storage benchmark (on c6i.4xlarge AWS):

/home/ubuntu/projects/moonbeam/target/release/moonbeam
   benchmark
   storage
   --db=${DB}
   --state-version=0
   --mul=1.1
   --weight-path  /home/ubuntu/projects/moonbeam/weights-${DB}-${NETWORK}.rs
   --chain ${NETWORK}
   --base-path /var/lib/${DB}-${NETWORK}-data

for each chain

Alphanet (~20M keys):

pub const RocksDbWeight: RuntimeDbWeight = RuntimeDbWeight {
  read: 65_167 * constants::WEIGHT_REF_TIME_PER_NANOS,
  write: 114_721 * constants::WEIGHT_REF_TIME_PER_NANOS,
};
pub const ParityDbWeight: RuntimeDbWeight = RuntimeDbWeight {
  read: 16_290 * constants::WEIGHT_REF_TIME_PER_NANOS,
  write: 65_374 * constants::WEIGHT_REF_TIME_PER_NANOS,
};

Moonriver (~30M keys state):

pub const RocksDbWeight: RuntimeDbWeight = RuntimeDbWeight {
  read: 66_865 * constants::WEIGHT_REF_TIME_PER_NANOS,
  write: 114_947 * constants::WEIGHT_REF_TIME_PER_NANOS,
};
pub const ParityDbWeight: RuntimeDbWeight = RuntimeDbWeight {
  read: 14_483 * constants::WEIGHT_REF_TIME_PER_NANOS,
  write: 64_545 * constants::WEIGHT_REF_TIME_PER_NANOS,
};

Moonbeam (~110M keys state):

pub const RocksDbWeight: RuntimeDbWeight = RuntimeDbWeight {
  read: 33_439 * constants::WEIGHT_REF_TIME_PER_NANOS,
  write: 86_828 * constants::WEIGHT_REF_TIME_PER_NANOS,
};
pub const ParityDbWeight: RuntimeDbWeight = RuntimeDbWeight {
  read: 177_320 * constants::WEIGHT_REF_TIME_PER_NANOS,
  write: 69_450 * constants::WEIGHT_REF_TIME_PER_NANOS,
};

Additionally to the paritydb numbers, we can also see that RocksDB average read is 50% on Moonbeam (110M keys) than Moonriver (30M keys), which might be related to the size of the data on Moonbeam being on average smaller than on Moonriver.

Details about the Benchmark output can be found there: https://gist.github.com/crystalin/8e790a554b246e077c83ad04c04f330c

crystalin commented 1 year ago

Additionally, it took something like 40h to generate the moonbeam storage benchmark

ggwpez commented 1 year ago

@cheme do you have an idea why the ParityDB time for read on the 110M keys DB is so much slower than Rocks when it is normally faster?

crystalin commented 1 year ago

You can download a recent Moonbeam state: https://s3.console.aws.amazon.com/s3/object/alan-stuff?region=us-east-1&prefix=moonbeam-state-3631095.json.lz4 (10GB) if you want to check it

cheme commented 1 year ago

That is definitely not expected. Could imagine worst access on big mmap memory, but not something in these proportions. Can also think of the data not being correctly build (there is a reindexing running in background every N values, but this is flushed on exit/start).

" based on substrate 0.9.40." : is it substrate version (looks old)? Would be interesting to have the parity-db version listed in the Cargo.lock (a version from a few month ago did have an issue that could explan some bad behavior cc\ @arkpar ).

Edit: https://github.com/PureStake/moonbeam/blob/6ed87ceeb65db27a9b2ce7ff32b90d062540bd67/Cargo.lock#L8942 parity db version is 0.4.6 which do not have https://github.com/paritytech/parity-db/pull/206 but I don't expect it to be related.

crystalin commented 1 year ago

I'm happy to cherry-pick some changes on top of it if you want to test few things. You can also probably reproduce by using the snapshot I provided

cheme commented 1 year ago

I'm happy to cherry-pick some changes on top of it if you want to test few things. You can also probably reproduce by using the snapshot I provided

Would be using latest version of parity db (cargo update -p parity-db), but then it only really would make sense if synching the snapshot from scratch.

Something I am thinking right now, did the memory cusumption stay correct during the process (looking at the bench code I suspect it could put many items in memory)?

Edit: just realize the snapshot is in json format so no need to resynch.

cheme commented 1 year ago

actually would be better patched parity-db master to include https://github.com/paritytech/parity-db/pull/211

crystalin commented 1 year ago

Ok I'll try that if I find time (also be aware that the benchmark took 40 hours so I won't get result quickly)

ggwpez commented 1 year ago

I cant even import the snap on a 64GB server… do you use 128?

arkpar commented 1 year ago

I've tried using warp sync on moonbeam. The sync went fine, although peak RAM usage was over 130GB. However the parachain is not finalizing blocks. Final block is still at zero. Is this a known issue? Unfinalized blocks are stored differently in the DB and this may affect performance.

arkpar commented 1 year ago

As for possible performance issues, it could be affected by how the benchmark is implemented. RocksDB uses its own caching, while ParityDb relies on the OS cache. IIRC the benchmark warmup touches a few of the keys, and for RocksDB this causes a lot more data to be pre-cached.

crystalin commented 1 year ago

@arkpar warp is not fully supported yet. We are still working on it. I also suspect the benchmark implementation is the reason for those unexpected values, but it is hard/long to verify

crystalin commented 1 year ago

@arkpar were you able to reproduce? Let me know if I can help otherwise

arkpar commented 12 months ago

I could not access the snapshot linked above. It requires AWS registration and asks for my credit card number. I've started regular sync instead and it looks like it will take 3-4 days.

arkpar commented 11 months ago

@crystalin Could you give it a test with parity-db 0.4.10? cargo update -p parity-db should do it

crystalin commented 11 months ago

I'm running it now. This time I looked at the CPU load and IO load, and during the benchmark:

IOPS: ~1300 (max is 3000)
CPU: 2.5%
Memory: 10% (max: 32Gb)

ggwpez commented 11 months ago

If the DB benchmark time is a major problem then we could add a flag to only read 10% or 1% of the total keys (randomly selected). That way you would have some preliminary results for faster iterating. Do you think that would help?

crystalin commented 11 months ago

That could make sense yes a % flag

crystalin commented 11 months ago

Warmup round just finished, I might get result this WE (Also memory jump to (95%)

crystalin commented 11 months ago

I was able to run it (with substrate 0.9.43 and paritydb 0.4.10). It took 3 days to finish:

pub const ParityDbWeight: RuntimeDbWeight = RuntimeDbWeight {
  read: 182_722 * constants::WEIGHT_REF_TIME_PER_NANOS,
  write: 60_176 * constants::WEIGHT_REF_TIME_PER_NANOS,
};

(No improvement at all)

cheme commented 11 months ago

I did check a bit more how to switch the chainspec loading to something that do not load all state in memory, but it is a bit more work than I did expect (break a lot the genesis build api since we need to do multiple commit while using a streaming json parser), so I postpone doing this myself for now. Still I got a better understanding of the benchmarking process and it just use the standard chainspec loading, which means that the full state is send in parity db but the bench run on a db that just got a lot of key injected. So the db may still be doing one or two levels of table reindexing when doing its benchmark, which would explain the performance issue.

This can be check by doing "ls" on the db directory and looking at the file for the state column: if it is still reindexing the state there will multiple file named paritydb/full/index_01_xx with xx being the index sizes.

If this is the case I do not have of a simple way of ensuring reindexing (changing default index size to paritydb can be a a hacky solution).

The following change in substrate would allow flushing the logs but would not force all reindexing to finish.

--- a/bin/node/cli/src/command.rs
+++ b/bin/node/cli/src/command.rs
@@ -127,6 +127,8 @@ pub fn run() -> Result<()> {
                                        ),
                                        #[cfg(feature = "runtime-benchmarks")]
                                        BenchmarkCmd::Storage(cmd) => {
+                                               // load once first to ensure db is flushed.
+                                               new_partial(&config)?;
                                                // ensure that we keep the task manager alive
                                                let partial = new_partial(&config)?;
                                                let db = partial.backend.expose_db();

but it would need to keep db open for a while until everything is reindex too.

Maybe simply doing the bench in two steps:

step 1 load chainspec (eg just start the bin with no connection to ensure only chainspec loading will progress, or have a specific command to do so). Then wait until there is no more reindexing in paritydb (single index_01_xx file) until exiting.
step 2 run benchmark on existing db.

Or implement a primitive that ensure all reindexing is finished in paritydb and use it before calling new_partial a second time (but it will not be very elegant as the code at this level do not assume a specific db).

crystalin commented 10 months ago

Thank you,

I think we did run the node with no connection (we often do for other profiling parts) before running the benchmark, but I can try again to see if that helps.

I think having substrate support the storage benchmark on a substate of the state would probably be more effective in that case.

ggwpez commented 10 months ago

Yes I hope to get https://github.com/paritytech/polkadot-sdk/issues/146 to some newcomer to solve. Forwarded it to a PBA student now.

crystalin commented 9 months ago

Outside of the storage benchmark, the performances of paritydb are also generally worse than rocksdb when the state is large (100M+ keys) and doing archive (I don't know how to measure to total number of keys in the db itself):

ParityDb:

2023-09-12T15:34:57.659Z utils:storage-query Queried 55384 keys @ 2769 keys/sec, 34 MB heap used
2023-09-12T15:35:02.659Z utils:storage-query Queried 82671 keys @ 3307 keys/sec, 46 MB heap used
2023-09-12T15:35:07.659Z utils:storage-query Queried 103743 keys @ 3458 keys/sec, 27 MB heap used
2023-09-12T15:35:12.659Z utils:storage-query Queried 130776 keys @ 3736 keys/sec, 21 MB heap used
2023-09-12T15:35:17.659Z utils:storage-query Queried 159459 keys @ 3986 keys/sec, 33 MB heap used
2023-09-12T15:35:22.659Z utils:storage-query Queried 184760 keys @ 4106 keys/sec, 18 MB heap used

RocksDb:

2023-09-12T15:36:44.978Z utils:storage-query Queried 520850 keys @ 17358 keys/sec, 30 MB heap used
2023-09-12T15:36:49.978Z utils:storage-query Queried 638850 keys @ 18249 keys/sec, 17 MB heap used
2023-09-12T15:36:54.979Z utils:storage-query Queried 784850 keys @ 19618 keys/sec, 15 MB heap used
2023-09-12T15:36:59.979Z utils:storage-query Queried 894850 keys @ 19882 keys/sec, 20 MB heap used
2023-09-12T15:37:04.981Z utils:storage-query Queried 975850 keys @ 19514 keys/sec, 24 MB heap used

paritytech / parity-db

Performance degrading a lot with high number of keys #212