memtrie: improve the performance of memtrie_lookup

nagisa commented 2 months ago

This set of changes applies a set of small changes which all ultimately end up in a some appreciable wins to the performance of memtrie_lookup function.

Although the two profiles I'm about to reference come from distinct workloads (they are 60 seconds of recording of a node tracking mainnet 2 days across, on different machines) the contextual information suggests that these two profiles are roughly comparable still (in particular the duration of node_hash and decode between the two profiles is pretty similar.) Here's the old run where you can see that view_impl is taking a relatively obscene amount of time in relation to what it is doing. In particular samples were showing that reading the one header byte was accounting for ~70% of the total runtime! And here's the profile after these changes. You can see view_impl (now view_kind) still taking a substantial amount of time, but overall time for this function and memtrie_lookup itself is appreciably lower.

I have root-caused the majority of the performance problem to come from the fact that there's a double-dereference needed to execute get_kind and other memtrie data accesses and that it was causing the CPU to stall, as these data were immediately used to execute an unpredictable multi-branch (the match.) That meant that the CPU had a penalty worth a full memory access latency for each execution of this view_impl function.

I have explored using memory prefetching instructions (prefetcht* on x86_64) but have found it to be difficult to apply in such a way that gives me the desired results. The alternative approach of splitting the loop in to_children was much more successful. Loading all the (up-to) 16 kinds through the double dereference into an array (fully pipelined with no data dependencies) in the first loop gives the CPU sufficient early notice on where the data will be coming from. The code then moves on to process nodes in sequence as before, but by the time it gets there (or in the worst case -- by the time it gets around to processing the 2nd element) data for all the children will have been loaded into the caches already.

One somewhat less obvious change is the one that I made to target_feature list. I was seeing that LLVM was not using popcnt for count_ones and then realized that since we require sha anyway, I might as well explicitly specify all the other features that are ~guaranteed to exist in implementations where SHA-NI is already available. I have avoided enabling the big ones such as avx, but the others, to the best of my knowledge, are fine wine.

codecov[bot] commented 2 months ago

Codecov Report

Attention: Patch coverage is 92.59259% with 2 lines in your changes missing coverage. Please review.

Project coverage is 71.62%. Comparing base (566e926) to head (baa9a11). Report is 9 commits behind head on master.

Files	Patch %	Lines
core/store/src/trie/mem/flexible_data/children.rs	95.00%	1 Missing :warning:
core/store/src/trie/mem/node/encoding.rs	83.33%	1 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #11961 +/- ## ========================================== - Coverage 71.62% 71.62% -0.01% ========================================== Files 810 810 Lines 163751 163767 +16 Branches 163751 163767 +16 ========================================== + Hits 117288 117297 +9 - Misses 41387 41397 +10 + Partials 5076 5073 -3 ``` | [Flag](https://app.codecov.io/gh/near/nearcore/pull/11961/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=near) | Coverage Δ | | |---|---|---| | [backward-compatibility](https://app.codecov.io/gh/near/nearcore/pull/11961/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=near) | `0.23% <0.00%> (-0.01%)` | :arrow_down: | | [db-migration](https://app.codecov.io/gh/near/nearcore/pull/11961/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=near) | `0.23% <0.00%> (-0.01%)` | :arrow_down: | | [genesis-check](https://app.codecov.io/gh/near/nearcore/pull/11961/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=near) | `1.33% <0.00%> (-0.01%)` | :arrow_down: | | [integration-tests](https://app.codecov.io/gh/near/nearcore/pull/11961/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=near) | `38.58% <92.59%> (+0.04%)` | :arrow_up: | | [linux](https://app.codecov.io/gh/near/nearcore/pull/11961/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=near) | `71.39% <92.59%> (+0.01%)` | :arrow_up: | | [linux-nightly](https://app.codecov.io/gh/near/nearcore/pull/11961/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=near) | `71.20% <92.59%> (-0.01%)` | :arrow_down: | | [macos](https://app.codecov.io/gh/near/nearcore/pull/11961/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=near) | `54.37% <92.59%> (+0.02%)` | :arrow_up: | | [pytests](https://app.codecov.io/gh/near/nearcore/pull/11961/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=near) | `1.60% <0.00%> (-0.01%)` | :arrow_down: | | [sanity-checks](https://app.codecov.io/gh/near/nearcore/pull/11961/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=near) | `1.40% <0.00%> (-0.01%)` | :arrow_down: | | [unittests](https://app.codecov.io/gh/near/nearcore/pull/11961/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=near) | `65.69% <92.59%> (-0.01%)` | :arrow_down: | | [upgradability](https://app.codecov.io/gh/near/nearcore/pull/11961/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=near) | `0.28% <0.00%> (-0.01%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=near#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

robin-near commented 2 months ago

Interesting results!

near / nearcore

memtrie: improve the performance of memtrie_lookup #11961

Codecov Report