Open romanz opened 3 years ago
I'm thinking of the best way to do this and here are my thoughts:
Vec<u8>
and use references to those data.bytes
crate but I think there's a sane way to avoid doing atomic operation for each transaction and output. We could build a safe abstraction for a cache that can store such things. I think my idea should not be too hard to review regarding safety.Unrelated but I'm playing with perf
tool (apt install perf-tools-unstable
) and got this during sync:
perf top -p <electrs_pid>
28.90% electrs [.] bitcoin_hashes::sha256::HashEngine::process_block
27.16% electrs [.] rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator const&>::RecomputeSpliceLevels
6.93% libc-2.28.so [.] __memcmp_sse4_1
5.76% electrs [.] rocksdb::MemTable::KeyComparator::operator()
Will try bitcoin eater address once it syncs.
Would it be possible to share a flamegraph please?
Would prefer to not run bespoke unverified commands on my mainnet instance. Looks like it could be solved with this: https://github.com/flamegraph-rs/flamegraph/issues/52
Would be interesting to see where the hashing is - are you verifying the merkle tree or just getting transaction txids to store into a map?
I believe it's getting txids.
Interestingly hashing takes a lot of time in case of index queries too:
14.72% electrs [.] _$LT$$u5b$u8$u3b$$u20$_$u5d$$u20$as$u20$bitcoin_hashes..hex..FromHex$GT$::from_byte_iter::h8eac85419fb461d3
9.64% electrs [.] <bitcoin_hashes::hex::HexIterator as core::iter::traits::double_ended::DoubleEndedIterator>::next_back
4.92% electrs [.] <std::collections::hash::map::DefaultHasher as core::hash::Hasher>::write
3.34% electrs [.] <serde_json::read::StrRead as serde_json::read::Read>::parse_str
2.89% electrs [.] <serde_json::read::SliceRead as serde_json::read::Read>::ignore_str
2.67% electrs [.] core::fmt::num::<impl core::fmt::LowerHex for i8>::fmt
2.53% electrs [.] hashbrown::map::HashMap<K,V,S>::contains_key
2.40% electrs [.] electrs::mempool::Mempool::sync
So optimizing it sounds like double-win. Also maybe the idea of putting an offset into index wasn't bad.
Another measurement (after electrs restart):
47.43% electrs [.] bitcoin_hashes::sha256::HashEngine::process_block
8.27% electrs [.] ZSTD_decompressBlock_internal.part.16
4.89% libc-2.28.so [.] __memcpy_ssse3
4.06% electrs [.] HUF_decompress4X2_usingDTable_internal
3.35% libc-2.28.so [.] _int_malloc
2.89% electrs [.] HUF_decompress4X4_usingDTable_internal
2.19% libc-2.28.so [.] malloc_consolidate
1.58% [kernel] [k] copy_user_enhanced_fast_string
1.48% electrs [.] rocksdb::crc32c::crc32c_3way
1.45% libc-2.28.so [.] _int_free
1.20% electrs [.] <std::io::Take<T> as std::io::Read
Idea: exploit the common case that some outputs of txses are changes which presumably get queried in the same request.
Example:
A user has an address A
and a change address B
. He received some sats to A
and later made a payment with change going to B
. Presumably his wallet will request address histories of A
and B
in a single request. Requesting information from bitcoind multiple times should be avoidable.
How to use this information: When a batched request comes, we first get all heights we want to scan and de-duplicate them. Then we fetch all those blocks from Core. Maybe it makes sense to pre-process earlier-received blocks while waiting for others. Then we filter the transactions and look if an input of any of them spends an output of some from the same batch. If we found it we will use this information and don't try to do any other such lookups. Then we fetch the remaining information.
This could ironically cause larger gap limits to perform better than smaller. (Currently too large gap limits can cause electrs to not work at all.)
Interestingly hashing takes a lot of time in case of index queries too:
I don't see any hashing here? Only from-hex conversion.
I copied it too late. The second measurement is better.
When a batched request comes, we first get all heights we want to scan and de-duplicate them. Then we fetch all those blocks from Core. Maybe it makes sense to pre-process earlier-received blocks while waiting for others.
Wouldn't a simple LRU cache get this and potentially other scenarios taken care of with a minimum memory investment? Additionally it's simple to fine tune LRU size to trade memory for performance.
Hmm probably yes if it's shared among all threads that are doing requests.
Hmm probably yes if it's shared among all threads that are doing requests.
Having it all go through one instance of Arc<dyn BlockDataProvider + Send + Sync>
(from my comment on caching) one could make it do stuff like:
by composing different parts of it as nested instances implementing this trait (service interface). Eventually of course.
Yeah, something like that should work. Presumably everything should return data wrapped in Arc
to avoid copying blocks/transactions but that shouldn't be an issue.
Based on the discussion in https://github.com/rust-bitcoin/rust-bitcoin/pull/680 I'd like to try this approach:
u16
u16::MAX
and there's more transactions in a block than u16::MAX
try hashing transactions in the tail like we do today.Sounds great :)
Maybe we can compute the "global offset" of a transaction using the following trick (IIRC, it is used by ElectrumX):
txid
is confirmed at block height k
at byte offset f
(within the block).h
as block_size[h]
.global_offset[txid] = sum(block_size[h] for h in range(k)) + f
We can store sum(block_size[h] for h in range(k))
for each confirmed block (at height k
), so we can compute k
and f
quickly given global_offset[txid]
.
Currently we use 4 bytes to represent the confirmation height for each index row.
Since the global_offset
is the last value within RocksDB key, we can just drop the most-significant zeroes during serialization - so it automatically use the smallest number of bytes needed to represent the offset correctly. For example, 0x0000001122334455
can be serialized as b"\x55\x44\x33\x22\x11"
(using only 5 bytes, dropping the 3 most-significant zero bytes).
Also, note that mainnet index has ~4B rows today - so using a larger offset is not expected to increase the DB size by too much.
It actually reminds me of the initial approach in #308, but using global_offset
, instead of blk*.dat
(file index, file offset) pair :)
* Make sure we're requesting non-witness blocks - no need to process witnesses.
Not sure it's possible to read non-witness blocks directly from disk :( https://github.com/bitcoin/bitcoin/blob/224e90d9fdf895d3ee212edcf7dec3eb4d94ce91/src/net_processing.cpp#L1810-L1819
It would be cool though to allow asking a part of a block via the p2p protocol :)
I'm currently in mental state incapable of understanding what your proposal with global_offset
achieves. But regarding asking non-witness blocks, I meant acting like an old, non-segwit, node to which bitcoind doesn't send witnesses (because it can't). Didn't check the code deeply to say whether it's already the case but I suspect it isn't.
I'm currently in mental state incapable of understanding what your proposal with
global_offset
achieves.
I had to re-read it again, and my understanding is that it allows skipping decoding the whole block, and instead one can decode single tx at a given offset. The cost would be rather small increase in index size, and binary search in the `height to offset" index (I guess it can be read in-memory, etc. to make it much faster).
I feel better now and my understanding is that for each transaction its offset within whole chain is stored and then offset withing block computed. What I fail to understand is why storing global offset is helpful. Why not just store the offset within block directly? 3B can address > 16M blocks and is just one more byte compared to my suggestion.
The main issue is the additional storage requirements... Note that since RocksDB rows are sorted by key, prefix compression is very effective for the first bytes (which are similar between adjacent rows), but doesn't work well the the last bytes. So adding 3 bytes * 4B rows, will probably increase the DB by 12GB :(
My understanding is global offset would increase it by 4bytes * 4B rows + global offset. Am I missing something?
I thought replacing the (block_height, tx_offset) with a single global_offset :) However, I am not sure how the design above will handle reorgs correctly... We can definitely start with (block_height, tx_offset) for simplicity - to see how the query performance behaves with the new approach.
We can later find a more efficient way of encoding (block_height, tx_offset), e.g. by using 3 bytes for block_height, and 3 bytes for tx_offset - so the DB size will increase by 2 bytes * 4B rows = 8GB.
I see now, that's a nice trick! Yes, I think having an experimental branch with simpler approach will be helpful in measuring the improvement.
However, I am not sure how the design above will handle reorgs correctly...
I'm not sure what exactly the problem is, but one way or another since reorgs affect only couple of last blocks and are generally rare, whatever code can fall back to look for stuff linearly block by block from highest block.
However, I am not sure how the design above will handle reorgs correctly...
For example, suppose that we store (block_height, tx_offset)
for a given scripthash
- and then the block at block_height
is reorged, and replaced by a totally different block.
Given a query for scripthash
, we will fetch the new block, and (most probably) fail to desererialize a valid transaction (since we use the old tx_offset
).
whatever code can fall back to look for stuff linearly block by block from highest block.
This would be probably the simplest solution :)
Alternative proposal: post on r/rust that electrs
is slower than ElectrumX which is in Python and let others do the work. I'm highly surprised @dpc didn't suggest it here.
An idea - for better reorg handling, we can save for each block a list of its transactions' sizes in bytes.
For example, 000000000003ba27aa200b1cecaad478d2b00432346c3f1f3986da1afd33e506 has 4 transactions, so we'll store their sizes in a list: [135, 259, 257, 225]
.
This way, each index entry will point to a (block_height, transaction_index)
so the exact block offset of any transaction can be derived from the list above, by summing the transaction byte sizes up to transaction_index
.
For example, if we need to retrieve the funding transactions of 1H8ANdafjpqYntniT3Ddxh4xPBMCSz33pj, the funding index will return (100000, 2)
, so we will fetch block 100000
via p2p as a binary blob.
The relevant transaction can be parsed from block[80+135+259:80+135+259+257]
without parsing the rest of the block.
Currently, there are ~700M Bitcoin transactions, so I don't expect the additional data to take significantly more storage (compared to the existing DB size of ~32GB).
In case of a reorg, the parsing will probably fail (and we'll issue a warning about that), but it should be OK to skip those transactions.
Since we save a transaction_index
(instead of a byte offset) for each index entry, we can probably use only 2 bytes (instead of 3), so the additional storage will be smaller - compared to https://github.com/romanz/electrs/issues/547#issuecomment-950269545.
OTOH, maybe it's even simpler to just store a "giant map" (using a new RocksDB column family), mapping:
transaction_global_index -> transaction_byte_offset_within_its_block
transaction_global_index
is computed by enumerating all Bitcoin transactions using the blockchain order.
It can be serialized as a VarInt
(to save storage space).transaction_byte_offset_within_its_block
is computed as described above: https://github.com/romanz/electrs/issues/547#issuecomment-976213878In addition, we also store the number of transactions in each block to each row in the headers
column family.
This way, in case of an reorg, we will just "overwrite" the stale index entries :)
The additional storage requirements are not expected to be large, since the "map" is sorted by its key (so it should compress well). The value (transaction_byte_offset_within_its_block
) can be represented using 3 bytes - and possibly less if serialized as a VarInt
.
The cool part is that we now replace the block_height
(4 bytes, stored in each index entry) with transaction_global_index
, which can also represented using 4 bytes - supporting up to 4,294,967,296 transactions (before we need to reindex).
I did not fully understand last reorg handling comments, but just wanted to note that reorgs are so extremely rare, that any tradeoff should be at reorg handling expense, and benefiting the happy path.
Is your feature request related to a problem? Please describe. https://github.com/romanz/electrs/pull/546#issuecomment-940740615
Additional context We currently use https://docs.rs/bitcoin/0.27.1/bitcoin/network/stream_reader/struct.StreamReader.html which can hopefully be optimized for our use-case.