paritytech / substrate

Substrate: The platform for blockchain innovators
Apache License 2.0
8.39k stars 2.65k forks source link

DB weights used in Substrate/Polkadot seem to be invalid #7212

Open maxsam4 opened 4 years ago

maxsam4 commented 4 years ago

The DB benchmarks ::trie::read and ::trie::write in node-bench do not actually read from the disks. The results they produce are almost identical, irrespective of which disk is used. i.e. Same results on an NVMe and a spinning HDD. I then looked at the disk io stats and both the read and write benchmarks write a lot of data to the disks as they clone the temporary DB but they read almost no data from the disk (Checked with tools like iotop on Linux).

Also, the bench uses 100% of a CPU core, and the ram usage increases with the size of DB being tested. I think the DB gets cached in the RAM and the results are CPU bound, not disk-bound. That kinda defeats the purpose of read/write benchmarks. This caching is done by the system when the benchmark clones the temporary DB. I used https://github.com/Feh/nocache to disable system caching and it increased the numbers by an order of magnitude. This suggests that the numbers being used on Polkadot/Kusama and other chains are much lower than what they should be and hence it opens up an attack vector. The DB in the real-world will not be fully cached in memory so the worse and average case read costs will be higher than what the current numbers suggest.

Below are the results with/without caching on my nvme disk

./target/release/node-bench ::trie::read::medium
2020-09-25 12:38:17.438 main INFO node_bench  Trie read benchmark(medium database (100,000 keys), db_type: RocksDb): avg 23.0 µs, w_avg 22.4 µs

nocache ./target/release/node-bench ::trie::read::medium
2020-09-25 12:39:37.122 main INFO node_bench  Trie read benchmark(medium database (100,000 keys), db_type: RocksDb): avg 0.2734 ms, w_avg 0.2661 ms
gui1117 commented 4 years ago

cc @NikVolf @shawntabrizi

NikVolf commented 4 years ago

As already discussed, any measurements "without cache" are pointless, since real systems use cache. We benchmark with empty initial cache (which approximates "cold-cache attack"), not without cache.

To the issue of benchmarks are the same on any kind of drive, I will investigate.

maxsam4 commented 4 years ago

As already discussed, any measurements "without cache" are pointless, since real systems use cache. We benchmark with empty initial cache (which approximates "cold-cache attack"), not without cache.

My point is that the current benchmarks are not using empty/cold cache.

When the benchmark generates temporary DB, the system caches that in memory and when the benchmark reads that data, the system returns the data from the memory rather than reading it from the disk. Even on the first read, you are never hitting the disk. Also, using nocache does not disable RocksDB cache. Therefore, any further reads are still being cached by RocksDB.

In a "cold-cache attack", the first read will hit the disk since it won't be cached by the system. The later hits will be cached but that happens when using nocache as well. Therefore, this benchmark is not a good representation of cold-cache attacks.

To reiterate - System level caches are different from RocksDB cache. In a "cold-cache attack", both system and RocksDB cache will miss. However, in the current benchmarks, only RocksDB cache is being missed, system level cache is still a hit.

nocache disables system level caching but does not disable RocksDB caching. Therefore, it more closely resembles "cold-cache" situations.

NikVolf commented 3 years ago

@maxsam4 Still a lot of internal machinery of different systems can rely on working system cache, so disabling it is no-go.

Can you test https://github.com/paritytech/substrate/pull/7242 on your spinning disk if it is not much trouble?

maxsam4 commented 3 years ago

@maxsam4 Still a lot of internal machinery of different systems can rely on working system cache, so disabling it is no-go.

Can you test #7242 on your spinning disk if it is not much trouble?

I tried something similar, didn't work: https://github.com/PolymathNetwork/Polymesh/commit/cdbe62a3185a7c1913865238dd607b485b89686d

Apparently, that only drops the read-cache. AFAIK There's no way to drop write cache without hacking the system APIs (that's what nocache does).

FWIW https://github.com/PolymathNetwork/Polymesh/commit/38f30b7dc0989cb205dd15142038f55afdaef854 makes it a lot easier to do these tests on different disks.

NikVolf commented 3 years ago

I tried something similar, didn't work: PolymathNetwork/Polymesh@cdbe62a

Are you sure that Commad::new actually run anything?

FWIW PolymathNetwork/Polymesh@38f30b7 makes it a lot easier to do these tests on different disks.

I think it's worth mentioning that benchmark better be run with TMPDIR=<path somewhere on disk to benchmark>, aye

NikVolf commented 3 years ago

Apparently, that only drops the read-cache. AFAIK There's no way to drop write cache without hacking the system APIs (that's what nocache does).

I've updated pr to try to address the write buffer (not sure it is always/often is used as cache though but anyway)

maxsam4 commented 3 years ago
TMPDIR=/media/max/Data/bench ./target/release/node-bench ::trie::read::medium
Oct 01 18:19:09.946  INFO node_bench: Starting Trie read benchmark(medium database (100,000 keys), db_type: RocksDb)    
Oct 01 18:21:25.933  INFO node_bench: Trie read benchmark(medium database (100,000 keys), db_type: RocksDb): avg 79.9 µs, w_avg 88.3 µs

Still looks too fast for a spinning disk. In your latest commit, you are rotating 64 * 32 = 2048 MB of RAM but I have 16 gigs on my system so not all cache is being cleared.

One way to ensure cache cleanup will to create all the temp DB in one go, ask the user to restart the system (that should clean all the cache) and then read values from the DBs in the next run.

maxsam4 commented 3 years ago

btw I am not sure if the std::process::Command commands are being executed at all.

NikVolf commented 3 years ago

You need to call output() or status() for it to execute

maxsam4 commented 3 years ago

You need to call output() or status() for it to execute

Even with output(), I'm not sure it's doing anything.

The following print nothing.

println!("{:?}", std::process::Command::new("thisshouldfail").output().unwrap());
println!("{:?}", std::process::Command::new("echo").args(&["yolo"]).output().unwrap());

I don't see any benchmark-tmp created by your command either. I increased the size of it to more than free space on my disk and also tried removing the rm command but still nada.

bkchr commented 3 years ago

@NikVolf was this fixed with your last pr?