An API to allow iterating over keys in a KVS without reading values

gsramdasi commented 2 years ago

Currently an application can use the cursor API to read through keys in a range. But the cursor also reads values. If the application only wants to read keys it would be useful to have a way to iterate through just the keys which would be more performant and may also avoid polluting the page cache.

tristan957 commented 2 years ago

This is currently planned for 2.1.0 I believe. We will be adding a cursor create flag for it.

gsramdasi commented 2 years ago

This could also be a cursor read flag so we skip values only for specific reads

gsramdasi commented 2 years ago

Also, this RFC should explore another option with cursors - A way for the application to pass in a buffer for keys and/or values. i.e. the application owns the buffers.

tristan957 commented 2 years ago

Wait, I didn't even realize this was Gaurav :).

gsramdasi commented 2 years ago

There was a user request for a way to read just the keys and not values using cursors (only-keys cursor reads). A related feature we’ve considered adding is to allow the caller to pass in buffers for keys and/or values (user-buffer cursor reads). The user-buffer cursor reads should also have a way to read just keys and not values.

So we will have 2 variants:

Default reads: We use our own buffers as we currently do. Also, we do not want to change this API’s signature.
Copyout reads: The key and/or value are copied out into the user’s buffer. This would need to be a new API since there are additional arguments to the function involved.

Similarity with point gets: The copyout variant is very similar to a hse_kvs_get() call where the caller passes in a buffer for the value. In the get call if the value buffer is NULL, the get turns into a probe operation. i.e. it doesn’t return the value, but whether the value was found or not and the value length. We could do something similar in the copyout cursor read when we just need to read the keys i.e. set the value buffer to NULL. This does seem more natural than using a flag like in Option 1. However, we cannot do this for the default read variant and it would need to use a flag. Now the two cursor read variants are inconsistent. Probably just something worth considering.

Since this is an API change, I wanted to get everyone’s opinion on how the APIs should be updated. I talked to @alexttx and here are two options we thought would make sense. Please let me know what you think and feel free to suggest a third option.

Option 1:

A new cursor read API is added that accepts the caller’s buffers. For both APIs the caller can pass in a flag so we read just the key.

#define HSE_CURSOR_READ_NOVALUE 1 // Flag passed to both variants

// Unchanged
hse_err_t
hse_kvs_cursor_read(
    struct hse_kvs_cursor *cursor,
    unsigned int           flags,
    const void **          key,
    size_t *               key_len,
    const void **          val,
    size_t *               val_len,
    bool *                 eof);

// The copyout variant
hse_err_t
hse_kvs_cursor_read_copy(
    struct hse_kvs_cursor *cursor,
    unsigned int           flags,
    const void *           key,
    size_t *               key_len,
    size_t                 key_buf_size,
    const void *           val,
    size_t *               val_len,
    size_t                 val_buf_size,
    bool *                 eof);

Option 2:

A separate entry point for reading just the keys for both variants instead of using a flag.

// Unchanged
hse_err_t
hse_kvs_cursor_read(
    struct hse_kvs_cursor *cursor,
    unsigned int           flags,
    const void **          key,
    size_t *               key_len,
    const void **          val,
    size_t *               val_len,
    bool *                 eof);

// Read only keys. HSE owns the buffers.
hse_err_t
hse_kvs_cursor_readkey(
    struct hse_kvs_cursor *cursor,
    unsigned int           flags,
    const void **          key,
    size_t *               key_len,
    bool *                 eof);

// Copyout variant. Read key and value.
hse_err_t
hse_kvs_cursor_read_copy(
    struct hse_kvs_cursor *cursor,
    unsigned int           flags,
    const void *           key,
    size_t *               key_len,
    size_t                 key_buf_size,
    const void *           val,
    size_t *               val_len,
    size_t                 val_buf_size,
    bool *                 eof);

// Copyout variant. Read just the key.
hse_err_t
hse_kvs_cursor_readkey_copy(
    struct hse_kvs_cursor *cursor,
    unsigned int           flags,
    const void *           key,
    size_t *               key_len,
    size_t                 key_buf_size,
    bool *                 eof);

tristan957 commented 2 years ago

Personally, I like option one much better. I think I like HSE_CURSOR_READ_KEY_ONLY instead of HSE_CURSOR_READ_NOVALUE.

Just spitballing an idea. Is there a way to make the read_copy() API the goto API for cursor reads where we could mark the original read() API as deprecated?

Potentially a better name than read_copy()? Maybe people know from other APIs they have used? Actual API content looks fine to me, although I might put key_buf_sz before key_len just to tie it to the buffer better. Same for val_buf_sz.

tristan957 commented 2 years ago

Not sure how this would look at hse-python level. I have an idea, but not important for now.

beckerg commented 2 years ago

Is there any reason the caller can't just pass in NULL for val in the existing API to effect a key read?

gsramdasi commented 2 years ago

The val in the existing API is a double pointer and it's set by hse to point to the region of memory that holds the value - i.e. it's an output. If we interpret val=NULL as read-only-key, the caller will have to explicitly set it to some dummy non-NULL value when it does want to read the value.

beckerg commented 2 years ago

I don't understand.. In the latter case it's never NULL and must be the address of a pointer in which hse will stash the ptr to the internal buffer. What am I missing? I'm not saying key on *val being NULL, I'm saying pass in NULL for val. In case I wasn't very clear...

gsramdasi commented 2 years ago

Yes, of course. That makes sense. So for either API, if val is set to NULL, we'd return just the key. No need for the flag.

// unchanged
hse_err_t
hse_kvs_cursor_read(
    struct hse_kvs_cursor *cursor,
    unsigned int           flags,
    const void **          key,
    size_t *               key_len,
    const void **          val,
    size_t *               val_len,
    bool *                 eof);

// The copyout variant
hse_err_t
hse_kvs_cursor_read_copy(
    struct hse_kvs_cursor *cursor,
    unsigned int           flags,
    const void *           key,
    size_t *               key_len,
    size_t                 key_buf_size,
    const void *           val,
    size_t *               val_len,
    size_t                 val_buf_size,
    bool *                 eof);

alexttx commented 2 years ago

I think the read and read_copy without flags makes the most sense.

beckerg commented 2 years ago

I don't care for the "read_copy" function. I'm thinking something along the lines of hse_kvs_cursor_readv() but give me some time to flesch it out.. Something like:

hse_err_t hse_kvs_cursoror_readv( struct hse_kvs_cursor cursor, uint flags, struct iovec iov, int iovcnt, bool *eof);

where iovec[0] describes a buffer into which to read the key, and iovec[1] describe a buffer into which to read the value. If iovcnt is 1 then it only reads the key. Either way, both only the length specified by the iovec is filled by the key and/or value.

tristan957 commented 2 years ago

What is struct iovec? I see. I like the idea Greg, but it seems weird to have a Linux-specific public API. Is there an equivalent for other systems?

alexttx commented 2 years ago

Do callers really need iovec support? Let's not use iovecs just because you don't like the name read_copy. None of our APIs use iovecs, so this would stand out like a sore thumb.

IMO KISS favors read and read_copy as Gaurav spelled out.

beckerg commented 2 years ago

Can we get rid of key_buf_size and val_buf_size and make key_len and val_len value-result parameters? Also, I think we need to lose the const from the read-copy() calls.. I really hate read-copy... readx(), read2(), readbuf() ??? The "read-copy()" form should also support valbuf=NULL, that way I can easily read all the keys directly out of hse and into my own buffer, thereby avoiding a buffer copy.

tristan957 commented 2 years ago

Is there any inspiration from RocksDB which the API could pull from?

alexttx commented 2 years ago

"_read_copy" make sense to me since it does an actual copy vs the "_read" version which does not.

gsramdasi commented 2 years ago

Thanks everyone. I've posted a PR for this change. See https://github.com/hse-project/hse/pull/105

smoyergh commented 2 years ago

@gsramdasi, do you plan to turn this into an RFC? That's probably overkill, and you've already merged the PR.

hse-project / rfcs

An API to allow iterating over keys in a KVS without reading values #6

Option 1:

Option 2: