valkey-io / valkey

A flexible distributed key-value datastore that supports both caching and beyond caching workloads.
https://valkey.io
Other
14.88k stars 530 forks source link

Introduce new rax function `raxAllocSize` to return rax tree allocation size in constant time #677

Open kyle-yh-kim opened 1 month ago

kyle-yh-kim commented 1 month ago

The problem/use-case that the feature addresses

Originally, the issue was opened in the Redis project to provide more visibility in slotToKey overhead [issue link].

A customer reached out to ask why their memory usage is very high. The customer had 40 bytes values and roughly 900 million keys. They compute the memory usage by doing key size * keyspace size. They observed a near 200% memory usage overhead on their production Redis server. A triage done by our team showed that over half of the overhead was from the slotToKey rax tree.

As a small experiment. I took the rax.c and rax.h files and compiled them into a small program. Inserting 484 Million keys into the rax tree yielded 48GB of memory usage. This is about 99 bytes per key.

The proposed solution would display the slotToKey rax tree as an overhead in the MEMORY STATS command output. This would allow Redis users to inspect and understand their memory usage in more detail.

However, since version 7.4, slotToKey rax tree was replaced with kvstore->dicts[slot]. While the original concern is no longer relevant, there still exist multiple motivations for introducing raxAllocSize, namely;

  1. Limiting stream growth

    • This was based on @madolson's comment. There has been multiple asks to support limiting a stream based of its size, for which raxAllocSize will enable.
  2. Per-slot memory metric

    • Building on top of the accurate stream size calculation, this will enable Valkey cluster to accurately track its per-slot memory usage.
  3. Detailed MEMORY STATS

    • Introduce a new field called overhead.rax, for tracking internal rax tree used to manage Valkey server. Examples include; acl, aof, active defrag, config and networking.

Description of the feature

Alternatives you've considered

N/A.

Additional information

This is a revision of an already opened issue that never closed from the Redis project.

zuiderkwast commented 1 month ago

Sure, go for it!

Just a thought: Is rax good for streams or should we consider replacing it with something else?

We've replaced rax in various places and now it's almost only used in streams. But maybe it's actually really good for timestamps? (Common prefixes)

knggk commented 1 month ago

Hi Viktor, can you give examples of where rax was replaced? Being sorted and providing range operations, I am guessing it's been replaced by a structure with those abilities as well (skiplist?).

What would be a good test to consider replacing rax on timestamps for streams? Or asked differently, what motivated the switch away from rax in the other places?

zuiderkwast commented 1 month ago

@knggk It was used for slot-to-key mapping. That was replaced by storing the keys in one dict per slot. Same for shard pubsub channels.

Some other internal usages where replaced by dict i think, but i don't remember exactly. A sorted structure was not needed in those cases.

knggk commented 1 month ago

@zuiderkwast Understood. I am not deeply familiar with streams, but it does require sorted elements to handle range operations (for example on timestamps in xrange, iterated on here https://github.com/valkey-io/valkey/blob/32ca6e5b38bfc4a15878c2b6ff3d2c71da1026e3/src/t_stream.c#L1802-L1804). So dict would not be an option. Another option though could be skiplist. (Not sure how Valkey's skiplist implementation is usable outside of sorted sets.)

zuiderkwast commented 1 month ago

@knggk Sure, dict would not be an option for streams. Purely hypothetically, imagine btree or some other exotic structure. Radix trees are good for storing keys with common prefixes. Timestamps do have a long common prefix (when they're stored in big endian, which we do) so I believe it's a pretty good choice for streams. But after that, Salvatore started using the radix trees for various other things, where they may not have been the best data structure to use.

knggk commented 1 month ago

@zuiderkwast Ok that makes sense. Thank you for providing that context and asking whether rax still makes sense.

I've cut an initial attempt at adding the functionality (salvaged a previous attempt). Happy to take any feedback before adding tests.

madolson commented 1 month ago

@knggk As part of your PR, or many someone else, can you consider porting https://github.com/antirez/rax/blob/master/rax-test.c over into our test base so we can write proper unit tests around the allocation tracking?

knggk commented 1 month ago

@madolson Done updating the PR with rax-test.c. Doing make SERVER_TEST=yes and then src/valkey-server test rax ends with OK! \o/.

Sounds like we're missing: 1) Integrating rax-test into the build system/github workflow 2) Plugging raxAllocSize into mem_usage so it shows for streams under show memory 3) Tests that demonstrate improved memory tracking for streams?

We can split the work with @kyle-yh-kim when we figure out how two people can work on a PR.

On 1) and 3) I don't have precise ideas so far, input very welcome.

zuiderkwast commented 1 month ago

@knggk Where's the PR?

Did you use the new unit test framework under src/unit/? If yes, then it's already included in the CI.

knggk commented 1 month ago

@zuiderkwast It's here https://github.com/valkey-io/valkey/pull/688. I tried to mention this issue (677) in that pull request thinking that's how you link an issue and a PR. Maybe I am confused.

Re src/unit, thanks! that's the piece of insight I was looking for. I will move rax-test.c under src/unit/test_rax.c and do the required adaptations.

Another issue I've faced: On a Linux dev machine, make valkey-unit-tests throws errors at link time:

    ARCHIVE libvalkey.a                                               
ar: threads_mngr.o: plugin needed to handle lto object
ar: adlist.o: plugin needed to handle lto object 
ar: quicklist.o: plugin needed to handle lto object
ar: ae.o: plugin needed to handle lto object 
...
    LINK valkey-unit-tests                                                                                                                                                                                                                                                               
/tmp/ccv5bGmV.ltrans0.ltrans.o: In function `freeTestCallback':                                                                                                                                                                                                                          
/workplace/valkey/src/unit/test_kvstore.c:10: undefined reference to `zfree'                                                                 
/tmp/ccv5bGmV.ltrans0.ltrans.o: In function `test_reclaimFilePageCache':
...

Note these failures are only on building the unit tests, not on building eg valkey-server.

I don't have issues linking unit tests on local Mac Sonoma (I am continuing there for now).

zuiderkwast commented 1 month ago

@zuiderkwast It's here #688. I tried to mention this issue (677) in that pull request thinking that's how you link an issue and a PR. Maybe I am confused.

Ah, I see now "knggk mentioned this issue 2 days ago". Great. Even better: If you write "Fixes #677" or "Closes #677" in the PR description, the PR with be linked to the issue and appear on the sidebar as a PR that will automatically close the issue when the PR is merged.

I'll look at the PR when I have some time. Thanks!

knggk commented 4 weeks ago

Didn't know about Fixes or Closes. Updated the text there. I can see it appearing in the side panel now as you were saying. We can continue the discussion in the PR. Thanks Viktor!