Investigate performance improvements related to slot ownership in getNodeByQuery

madolson commented 3 weeks ago

While playing around with profiling various workloads, I noticed that getNodeByQuery was quite slow, taking about 4% of the CPU of a given command. After deep dive, I realized that we basically are checking three 171kb (16384 * 8) blocks of memory to determine who owns the slot being accessed, whether that slot is migrating, and whether that is slot is being imported. The amount of data far exceeds L1 cache sizes and will also exceed most L2 caches, so we are likely experiencing a large number of cache misses because of these lookups. If you remove all three of those bottlenecks, you get about a 5% performance overall performance improvement.

My thought is that we can use the slot bits on the clusterNode object to determine if we own the slot. We assume that most of the time the client is going to send requests to the right node. We can also keep a smaller structure of all the nodes we are importing from and migrating to, so we don't have to check the map on every lookup.

Potential followup of https://github.com/valkey-io/valkey/pull/631.

adetunjii commented 3 weeks ago

@madolson do you mind if we work on this together? I'm not exactly sure how some of these things work but we can definitely work through it together.

PingXie commented 3 weeks ago

I realized that we basically are checking three 171kb (16384 * 8) blocks of memory to determine who owns the slot being accessed

Aren't these point look-ups? The total size shouldn't matter but I see your point that will be cache misses given the way these 3 arrays are arranged. I wonder if we should group the 3 states (ownership/migrating/importing) by the slot, i.e., replacing them with a 2-d array like slot_states[16384][3].

Also while checking out getNodeByQuery, I noticed that keys can be hashed more than once on the data serving path with per-slot dict.

madolson commented 2 weeks ago

Aren't these point look-ups? The total size shouldn't matter but I see your point that will be cache misses given the way these 3 arrays are arranged.

They are! The problem is basically every lookup into these arrays is an L2 cache miss (and often an L3 + TLB miss), which are very expensive (even compared to hashing).

I wonder if we should group the 3 states (ownership/migrating/importing) by the slot, i.e., replacing them with a 2-d array like slot_states[16384][3].

I'm going to assume that slot_states still points to the dictionary, so they are still 8 byte values. This is one option that I briefly prototyped, and it definitely helps. It helps in that we are only basically doing one main memory lookup to get the slot_states and pulls the cache line that has all three values on it. Ideally we would find away not to spend ~340kb on slot mappings, but if there isn't a good solution, I think this is a good fallback.

Also while checking out getNodeByQuery, I noticed that keys can be hashed more than once on the data serving path with per-slot dict.

This shockingly isn't that expensive (or wasn't obviously a bottleneck while profiling), since it's all in L1 cache at this point, the actual hashing is fast for most normal workloads where keys are very small.

madolson commented 2 weeks ago

@madolson do you mind if we work on this together? I'm not exactly sure how some of these things work but we can definitely work through it together.

I'm happy to help bounce ideas off of. I don't exactly have a solution in mind, but it would be nice to have some folks prototype some of the ideas.

adetunjii commented 2 weeks ago

Okay. You can assign it to me then.

PingXie commented 2 weeks ago

The problem is basically every lookup into these arrays is an L2 cache miss (and often an L3 + TLB miss), which are very expensive (even compared to hashing).

Make sense. Good point on TLB misses too, which I think could very likely happen.

Ideally we would find away not to spend ~340kb on slot mappings, but if there isn't a good solution, I think this is a good fallback.

I think these can be parallel efforts: cache line optimization and size optimization. Even we find a way to shrink the slot_states by 64x (switching to a bitmap, for instance), being able to group the 3 states by slots is still valuable.

This shockingly isn't that expensive (or wasn't obviously a bottleneck while profiling), since it's all in L1 cache at this point

Make sense.

madolson commented 2 weeks ago

I think these can be parallel efforts: cache line optimization and size optimization. Even we find a way to shrink the slot_states by 64x (switching to a bitmap, for instance), being able to group the 3 states by slots is still valuable.

I am advocating to deleting the importing_from and migrating_to tables, since they are very large and generally empty :) and coming up with a more space optimized version. Which is why I saw them as mutually exclusive. If there is no good way to do that though, we could optimize the cache line then.

PingXie commented 2 weeks ago

they are very large and generally empty

That is true. Though with our current thinking on atomic slot migration, this might get less empty in the future. O(1) lookup is a really nice property :-).

If there is no good way to do that though, we could optimize the cache line then.

Agreed.

valkey-io / valkey

Investigate performance improvements related to slot ownership in getNodeByQuery #632