Feature: optimize p2p block parsing

romanz commented 3 years ago

Is your feature request related to a problem? Please describe. https://github.com/romanz/electrs/pull/546#issuecomment-940740615

Additional context We currently use https://docs.rs/bitcoin/0.27.1/bitcoin/network/stream_reader/struct.StreamReader.html which can hopefully be optimized for our use-case.

Kixunil commented 3 years ago

I'm thinking of the best way to do this and here are my thoughts:

If we receive a block we should put its data into single Vec<u8> and use references to those data.
We need to iterate through whole block and hash each transaction and each output when filtering. So let's just cache those values once we have them ready.
Above requires (immutable) self-referential data structure, we could use bytes crate but I think there's a sane way to avoid doing atomic operation for each transaction and output. We could build a safe abstraction for a cache that can store such things. I think my idea should not be too hard to review regarding safety.

Kixunil commented 3 years ago

Unrelated but I'm playing with perf tool (apt install perf-tools-unstable) and got this during sync:

perf top -p <electrs_pid>

  28.90%  electrs              [.] bitcoin_hashes::sha256::HashEngine::process_block                                                                                                                               
  27.16%  electrs              [.] rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator const&>::RecomputeSpliceLevels                                                                                      
   6.93%  libc-2.28.so         [.] __memcmp_sse4_1                                                                                                                                                                 
   5.76%  electrs              [.] rocksdb::MemTable::KeyComparator::operator()

Will try bitcoin eater address once it syncs.

romanz commented 3 years ago

Would it be possible to share a flamegraph please?

Kixunil commented 3 years ago

Would prefer to not run bespoke unverified commands on my mainnet instance. Looks like it could be solved with this: https://github.com/flamegraph-rs/flamegraph/issues/52

TheBlueMatt commented 3 years ago

Would be interesting to see where the hashing is - are you verifying the merkle tree or just getting transaction txids to store into a map?

Kixunil commented 3 years ago

I believe it's getting txids.

Kixunil commented 3 years ago

Interestingly hashing takes a lot of time in case of index queries too:

  14.72%  electrs              [.] _$LT$$u5b$u8$u3b$$u20$_$u5d$$u20$as$u20$bitcoin_hashes..hex..FromHex$GT$::from_byte_iter::h8eac85419fb461d3                                                                     
   9.64%  electrs              [.] <bitcoin_hashes::hex::HexIterator as core::iter::traits::double_ended::DoubleEndedIterator>::next_back                                                                          
   4.92%  electrs              [.] <std::collections::hash::map::DefaultHasher as core::hash::Hasher>::write                                                                                                       
   3.34%  electrs              [.] <serde_json::read::StrRead as serde_json::read::Read>::parse_str                                                                                                                
   2.89%  electrs              [.] <serde_json::read::SliceRead as serde_json::read::Read>::ignore_str                                                                                                             
   2.67%  electrs              [.] core::fmt::num::<impl core::fmt::LowerHex for i8>::fmt                                                                                                                          
   2.53%  electrs              [.] hashbrown::map::HashMap<K,V,S>::contains_key                                                                                                                                    
   2.40%  electrs              [.] electrs::mempool::Mempool::sync

So optimizing it sounds like double-win. Also maybe the idea of putting an offset into index wasn't bad.

Kixunil commented 3 years ago

Another measurement (after electrs restart):

  47.43%  electrs              [.] bitcoin_hashes::sha256::HashEngine::process_block                                                                                                                               
   8.27%  electrs              [.] ZSTD_decompressBlock_internal.part.16                                                                                                                                           
   4.89%  libc-2.28.so         [.] __memcpy_ssse3                                                                                                                                                                  
   4.06%  electrs              [.] HUF_decompress4X2_usingDTable_internal                                                                                                                                          
   3.35%  libc-2.28.so         [.] _int_malloc                                                                                                                                                                     
   2.89%  electrs              [.] HUF_decompress4X4_usingDTable_internal                                                                                                                                          
   2.19%  libc-2.28.so         [.] malloc_consolidate                                                                                                                                                              
   1.58%  [kernel]             [k] copy_user_enhanced_fast_string                                                                                                                                                  
   1.48%  electrs              [.] rocksdb::crc32c::crc32c_3way                                                                                                                                                    
   1.45%  libc-2.28.so         [.] _int_free                                                                                                                                                                       
   1.20%  electrs              [.] <std::io::Take<T> as std::io::Read

Kixunil commented 3 years ago

Idea: exploit the common case that some outputs of txses are changes which presumably get queried in the same request.

Example: A user has an address A and a change address B. He received some sats to A and later made a payment with change going to B. Presumably his wallet will request address histories of A and B in a single request. Requesting information from bitcoind multiple times should be avoidable.

How to use this information: When a batched request comes, we first get all heights we want to scan and de-duplicate them. Then we fetch all those blocks from Core. Maybe it makes sense to pre-process earlier-received blocks while waiting for others. Then we filter the transactions and look if an input of any of them spends an output of some from the same batch. If we found it we will use this information and don't try to do any other such lookups. Then we fetch the remaining information.

This could ironically cause larger gap limits to perform better than smaller. (Currently too large gap limits can cause electrs to not work at all.)

TheBlueMatt commented 3 years ago

Interestingly hashing takes a lot of time in case of index queries too:

I don't see any hashing here? Only from-hex conversion.

Kixunil commented 3 years ago

I copied it too late. The second measurement is better.

dpc commented 3 years ago

When a batched request comes, we first get all heights we want to scan and de-duplicate them. Then we fetch all those blocks from Core. Maybe it makes sense to pre-process earlier-received blocks while waiting for others.

Wouldn't a simple LRU cache get this and potentially other scenarios taken care of with a minimum memory investment? Additionally it's simple to fine tune LRU size to trade memory for performance.

Kixunil commented 3 years ago

Hmm probably yes if it's shared among all threads that are doing requests.

dpc commented 3 years ago

Hmm probably yes if it's shared among all threads that are doing requests.

Having it all go through one instance of Arc<dyn BlockDataProvider + Send + Sync> (from my comment on caching) one could make it do stuff like:

keep track of requests in flight and thus scheduling and merging requests
optional caching,
stats & metrics
per-IP block-fetch based rate limiting and fairness

by composing different parts of it as nested instances implementing this trait (service interface). Eventually of course.

Kixunil commented 3 years ago

Yeah, something like that should work. Presumably everything should return data wrapped in Arc to avoid copying blocks/transactions but that shouldn't be an issue.

Kixunil commented 3 years ago

Based on the discussion in https://github.com/rust-bitcoin/rust-bitcoin/pull/680 I'd like to try this approach:

For each transaction store the transaction index in the database as u16
When requesting a block, deserialize it without allocation - just seek to given transaction index. If the index is u16::MAX and there's more transactions in a block than u16::MAX try hashing transactions in the tail like we do today.
Hash the transaction (output) to check if it's the transaction we are interested in.
Extract important information from the transaction to satisfy the query (e.g. ignore signatures)
Make sure we're requesting non-witness blocks - no need to process witnesses.

romanz commented 3 years ago

Sounds great :)

romanz commented 3 years ago

Maybe we can compute the "global offset" of a transaction using the following trick (IIRC, it is used by ElectrumX):

assume the transaction txid is confirmed at block height k at byte offset f (within the block).
denote the block size @ height h as block_size[h].
global_offset[txid] = sum(block_size[h] for h in range(k)) + f

We can store sum(block_size[h] for h in range(k)) for each confirmed block (at height k), so we can compute k and f quickly given global_offset[txid].

Currently we use 4 bytes to represent the confirmation height for each index row.

5 bytes (40 bits) should be enough to represent ~1.1TB of blocks.
6 bytes (48 bits) should be enough to represent ~281TB of blocks.

Since the global_offset is the last value within RocksDB key, we can just drop the most-significant zeroes during serialization - so it automatically use the smallest number of bytes needed to represent the offset correctly. For example, 0x0000001122334455 can be serialized as b"\x55\x44\x33\x22\x11" (using only 5 bytes, dropping the 3 most-significant zero bytes).

Also, note that mainnet index has ~4B rows today - so using a larger offset is not expected to increase the DB size by too much.

It actually reminds me of the initial approach in #308, but using global_offset, instead of blk*.dat (file index, file offset) pair :)

romanz commented 3 years ago

* Make sure we're requesting non-witness blocks - no need to process witnesses.

Not sure it's possible to read non-witness blocks directly from disk :( https://github.com/bitcoin/bitcoin/blob/224e90d9fdf895d3ee212edcf7dec3eb4d94ce91/src/net_processing.cpp#L1810-L1819

It would be cool though to allow asking a part of a block via the p2p protocol :)

Kixunil commented 3 years ago

I'm currently in mental state incapable of understanding what your proposal with global_offset achieves. But regarding asking non-witness blocks, I meant acting like an old, non-segwit, node to which bitcoind doesn't send witnesses (because it can't). Didn't check the code deeply to say whether it's already the case but I suspect it isn't.

dpc commented 3 years ago

I'm currently in mental state incapable of understanding what your proposal with global_offset achieves.

I had to re-read it again, and my understanding is that it allows skipping decoding the whole block, and instead one can decode single tx at a given offset. The cost would be rather small increase in index size, and binary search in the `height to offset" index (I guess it can be read in-memory, etc. to make it much faster).

Kixunil commented 3 years ago

I feel better now and my understanding is that for each transaction its offset within whole chain is stored and then offset withing block computed. What I fail to understand is why storing global offset is helpful. Why not just store the offset within block directly? 3B can address > 16M blocks and is just one more byte compared to my suggestion.

romanz commented 3 years ago

The main issue is the additional storage requirements... Note that since RocksDB rows are sorted by key, prefix compression is very effective for the first bytes (which are similar between adjacent rows), but doesn't work well the the last bytes. So adding 3 bytes * 4B rows, will probably increase the DB by 12GB :(

Kixunil commented 3 years ago

My understanding is global offset would increase it by 4bytes * 4B rows + global offset. Am I missing something?

romanz commented 3 years ago

I thought replacing the (block_height, tx_offset) with a single global_offset :) However, I am not sure how the design above will handle reorgs correctly... We can definitely start with (block_height, tx_offset) for simplicity - to see how the query performance behaves with the new approach.

romanz commented 3 years ago

We can later find a more efficient way of encoding (block_height, tx_offset), e.g. by using 3 bytes for block_height, and 3 bytes for tx_offset - so the DB size will increase by 2 bytes * 4B rows = 8GB.

Kixunil commented 3 years ago

I see now, that's a nice trick! Yes, I think having an experimental branch with simpler approach will be helpful in measuring the improvement.

dpc commented 3 years ago

However, I am not sure how the design above will handle reorgs correctly...

I'm not sure what exactly the problem is, but one way or another since reorgs affect only couple of last blocks and are generally rare, whatever code can fall back to look for stuff linearly block by block from highest block.

romanz commented 3 years ago

However, I am not sure how the design above will handle reorgs correctly...

For example, suppose that we store (block_height, tx_offset) for a given scripthash - and then the block at block_height is reorged, and replaced by a totally different block. Given a query for scripthash, we will fetch the new block, and (most probably) fail to desererialize a valid transaction (since we use the old tx_offset).

whatever code can fall back to look for stuff linearly block by block from highest block.

This would be probably the simplest solution :)

Kixunil commented 3 years ago

Alternative proposal: post on r/rust that electrs is slower than ElectrumX which is in Python and let others do the work. I'm highly surprised @dpc didn't suggest it here.

romanz commented 3 years ago

An idea - for better reorg handling, we can save for each block a list of its transactions' sizes in bytes.

For example, 000000000003ba27aa200b1cecaad478d2b00432346c3f1f3986da1afd33e506 has 4 transactions, so we'll store their sizes in a list: [135, 259, 257, 225].

This way, each index entry will point to a (block_height, transaction_index) so the exact block offset of any transaction can be derived from the list above, by summing the transaction byte sizes up to transaction_index. For example, if we need to retrieve the funding transactions of 1H8ANdafjpqYntniT3Ddxh4xPBMCSz33pj, the funding index will return (100000, 2), so we will fetch block 100000 via p2p as a binary blob. The relevant transaction can be parsed from block[80+135+259:80+135+259+257] without parsing the rest of the block.

Currently, there are ~700M Bitcoin transactions, so I don't expect the additional data to take significantly more storage (compared to the existing DB size of ~32GB).

In case of a reorg, the parsing will probably fail (and we'll issue a warning about that), but it should be OK to skip those transactions.

romanz commented 3 years ago

Since we save a transaction_index (instead of a byte offset) for each index entry, we can probably use only 2 bytes (instead of 3), so the additional storage will be smaller - compared to https://github.com/romanz/electrs/issues/547#issuecomment-950269545.

romanz commented 3 years ago

OTOH, maybe it's even simpler to just store a "giant map" (using a new RocksDB column family), mapping:

transaction_global_index -> transaction_byte_offset_within_its_block

transaction_global_index is computed by enumerating all Bitcoin transactions using the blockchain order. It can be serialized as a VarInt (to save storage space).
transaction_byte_offset_within_its_block is computed as described above: https://github.com/romanz/electrs/issues/547#issuecomment-976213878

In addition, we also store the number of transactions in each block to each row in the headers column family. This way, in case of an reorg, we will just "overwrite" the stale index entries :)

romanz commented 3 years ago

The additional storage requirements are not expected to be large, since the "map" is sorted by its key (so it should compress well). The value (transaction_byte_offset_within_its_block) can be represented using 3 bytes - and possibly less if serialized as a VarInt. The cool part is that we now replace the block_height (4 bytes, stored in each index entry) with transaction_global_index, which can also represented using 4 bytes - supporting up to 4,294,967,296 transactions (before we need to reindex).

dpc commented 3 years ago

I did not fully understand last reorg handling comments, but just wanted to note that reorgs are so extremely rare, that any tradeoff should be at reorg handling expense, and benefiting the happy path.

romanz / electrs

Feature: optimize p2p block parsing #547