Closed nagisa closed 2 months ago
Attention: Patch coverage is 92.59259%
with 2 lines
in your changes missing coverage. Please review.
Project coverage is 71.62%. Comparing base (
566e926
) to head (baa9a11
). Report is 9 commits behind head on master.
Files | Patch % | Lines |
---|---|---|
core/store/src/trie/mem/flexible_data/children.rs | 95.00% | 1 Missing :warning: |
core/store/src/trie/mem/node/encoding.rs | 83.33% | 1 Missing :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Interesting results!
This set of changes applies a set of small changes which all ultimately end up in a some appreciable wins to the performance of memtrie_lookup function.
Although the two profiles I'm about to reference come from distinct workloads (they are 60 seconds of recording of a node tracking mainnet 2 days across, on different machines) the contextual information suggests that these two profiles are roughly comparable still (in particular the duration of
node_hash
anddecode
between the two profiles is pretty similar.) Here's the old run where you can see thatview_impl
is taking a relatively obscene amount of time in relation to what it is doing. In particular samples were showing that reading the one header byte was accounting for ~70% of the total runtime! And here's the profile after these changes. You can seeview_impl
(nowview_kind
) still taking a substantial amount of time, but overall time for this function andmemtrie_lookup
itself is appreciably lower.I have root-caused the majority of the performance problem to come from the fact that there's a double-dereference needed to execute
get_kind
and other memtrie data accesses and that it was causing the CPU to stall, as these data were immediately used to execute an unpredictable multi-branch (the match.) That meant that the CPU had a penalty worth a full memory access latency for each execution of thisview_impl
function.I have explored using memory prefetching instructions (
prefetcht*
on x86_64) but have found it to be difficult to apply in such a way that gives me the desired results. The alternative approach of splitting the loop into_children
was much more successful. Loading all the (up-to) 16 kinds through the double dereference into an array (fully pipelined with no data dependencies) in the first loop gives the CPU sufficient early notice on where the data will be coming from. The code then moves on to process nodes in sequence as before, but by the time it gets there (or in the worst case -- by the time it gets around to processing the 2nd element) data for all the children will have been loaded into the caches already.One somewhat less obvious change is the one that I made to
target_feature
list. I was seeing that LLVM was not usingpopcnt
forcount_ones
and then realized that since we requiresha
anyway, I might as well explicitly specify all the other features that are ~guaranteed to exist in implementations where SHA-NI is already available. I have avoided enabling the big ones such asavx
, but the others, to the best of my knowledge, are fine wine.