stacks-network / stacks-core

The Stacks blockchain implementation
https://docs.stacks.co
GNU General Public License v3.0
3.01k stars 672 forks source link

[Optimization] Replace `BTreeMap`s in `clarity` with `hashbrown::HashMap` #4464

Open cylewitruk opened 8 months ago

cylewitruk commented 8 months ago

Creating a ticket for further discussion of the following from @ASuciuX on Slack:

Hello, I had a sync with Jeff. Basically from the profiling data TupleData::from_data() and the BTreeMap ops inside it takes over 5% of the total time on the run with blocks around 100k block height. Thinking about there may be places where we can replace BTreeMap with HashMap or IndexMap to improve on that. Are there specific places where we would want BTreeMap? It would be great if you could share them and why it is relevant to have the keys sorted in those places?

Tagging @jcnelson and @kantai who also chimed in on the above.

The following types make use of BTreeMaps in the clarity crate:

The usage of BTreeMap is primarily to preserve ordering of keys. However, ordering of keys is only really required when serializing consensus-critical types to their SIP-5 consensus binary format, which in turn is hashed and stored as a part of the chainstate.

Given that, it should be possible to replace most, if not all, usages of BTreeMap with HashSets, and perform a sort-on-consensus-serialization.

I did some benchmarking of different map implementations, and here's the result: btree hashmap (std) hashmap (hashbrown) indexmap
insert 653.46 us (✅ 1.00x) 611.74 us (✅ 1.07x faster) 597.16 us (✅ 1.09x faster) 581.33 us (✅ 1.12x faster)
to sorted vec 43.13 us (✅ 1.00x) 78.96 us (❌ 1.83x slower) 71.72 us (❌ 1.66x slower) 79.61 us (❌ 1.85x slower)
random lookups 81.21 us (✅ 1.00x) 29.87 us (🚀 2.72x faster) 16.08 us (🚀 5.05x faster) 30.70 us (🚀 2.65x faster)

While BTreeMap excels at retrieving ordered entries, the other implementations offer a marginal improvement for inserts and marked improvements for random reads. Given the assumption that inserts and random reads largely outweigh sorted enumeration, choosing another implementation may offer measurable timing improvements.

I put together a little playground showing some potential alternatives, in addition to the benchmarks.

kantai commented 8 months ago

The current hotpath for the BTreeMaps in block execution seems to be TupleData::from_data(). This performs a lot of random reads, inserts, and reallocs.

I think before considering replacing the BTreeMap structs (which are not pure wins: serialization would be slower!), we should be benchmarking improvements to that hotpath, e.g.:

    pub fn from_data(data: Vec<(ClarityName, Value)>) -> Result<TupleData> {
        let type_map = data.iter().map(|(name, value)| {
            Ok((name.clone(), TypeSignature::type_of(value)?))
        }).collect::<Result<BTreeMap<_, _>>>()?;
        if type_map.len() != data.len() {
            // some name was already used but we don't know which... can iterate over data to find it if necessary.
            return Err(CheckErrors::NameAlreadyUsed("Unclear!".into()).into());
        }
        let data_map = data.into_iter().collect();
        Self::new(TupleTypeSignature::try_from(type_map)?, data_map)
    }
cylewitruk commented 8 months ago

I think before considering replacing the BTreeMap structs (which are not pure wins: serialization would be slower!), we should be benchmarking improvements to that hotpath

Good input with the example! :+1: I personally feel that both of these are small enough to pursue in parallel and bench -- both shouldn't take more than a couple hours tbh, and I'd love to see the results of both.

Then we can re-awaken the discussion regarding risk of potential hash collisions here as well before anything gets merged :pray:

jbencin commented 8 months ago

@kantai I have an open PR on that, #4458

Also, I've tried using the collect::<Result<_<_,_>>>()? pattern a few times. It doesn't optimize well, and is noticeably slower than a for loop. Rust iterators don't always optimize as well as you'd expect, flat_map() is another example. Maybe because those can't take advantage of size_hints()?

kantai commented 8 months ago

@kantai I have an open PR on that, https://github.com/stacks-network/stacks-core/pull/4458

Also, I've tried using the collect::<Result<<,_>>>()? pattern a few times. It doesn't optimize well, and is noticeably slower than a for loop. Rust iterators don't always optimize as well as you'd expect, flat_map() is another example. Maybe because those can't take advantage of size_hints()?

Did you try collect() here? It seems like BTreeMap FromIterator should be much more efficient than repeated inserts: it presorts the insertion and does a bulk insert, whereas even .entry() -> .insert() actually incurs the costs of self-balancing each entry.

jbencin commented 8 months ago

Did you try collect() here?

I'm not sure if I did, but I can check again, and I can check the source of BTreeMap::FromIterator() to see what it's actually doing

jbencin commented 8 months ago

@kantai it's slower using .collect():

❯ hyperfine -w 3 -r 20 "target/release/stacks-inspect.opt-tuple-data replay-block /home/jbencin/data/next/ range 99990 100000" "target/release/stacks-inspect.opt-tuple-data-collect replay-block /home/jbencin/data/next/ range 99990 100000"
Benchmark 1: target/release/stacks-inspect.opt-tuple-data replay-block /home/jbencin/data/next/ range 99990 100000
  Time (mean ± σ):      8.874 s ±  0.020 s    [User: 8.372 s, System: 0.457 s]
  Range (min … max):    8.836 s …  8.912 s    20 runs

Benchmark 2: target/release/stacks-inspect.opt-tuple-data-collect replay-block /home/jbencin/data/next/ range 99990 100000
  Time (mean ± σ):      9.160 s ±  0.026 s    [User: 8.663 s, System: 0.454 s]
  Range (min … max):    9.104 s …  9.210 s    20 runs

Summary
  target/release/stacks-inspect.opt-tuple-data replay-block /home/jbencin/data/next/ range 99990 100000 ran
    1.03 ± 0.00 times faster than target/release/stacks-inspect.opt-tuple-data-collect replay-block /home/jbencin/data/next/ range 99990 100000

I don't know what's going on here, maybe the data is already sorted and doing the extra Vec allocation and sort in .collect() just wastes time

It's been pretty disappointing benchmarking iterator performance, as I much prefer coding using this style over for loops and if statements