Add `cursorId` to `ScanOptions`

spring-projects / spring-data-redis

Provides support to increase developer productivity in Java when using Redis, a key-value store. Uses familiar Spring concepts such as a template classes for core API usage and lightweight repository style data access.

https://spring.io/projects/spring-data-redis/

Apache License 2.0

1.74k stars 1.15k forks source link

Add `cursorId` to `ScanOptions` #2584

Open wendao8469759 opened 1 year ago

wendao8469759 commented 1 year ago

I have searched all the information and found that scan cannot set cursor and setting count is invalid. What is the correct way to write it here?

wendao8469759 commented 1 year ago

private Set<String> scanTargetKeys(String keyWords, Integer pageNum,Integer pageSize) {
        if (StringUtil.isEmpty(keyWords)) {
            keyWords = "*";
        }
        String finalKeyWords = keyWords;
        return redisTemplate.execute((RedisCallback<Set<String>>) connection -> {
            Set<String> keysTmp = new HashSet<>();
            ScanOptions.ScanOptionsBuilder scanOptionsBuilder = ScanOptions.scanOptions();
            scanOptionsBuilder.match(finalKeyWords);
            scanOptionsBuilder.count(pageSize);

            Cursor<byte[]> cursor = connection.scan(scanOptionsBuilder.build());
            int i = 0;
            while (cursor.hasNext()) {
                String key = new String(cursor.next());
                if (i >= (pageNum - 1) * pageSize && i < pageNum * pageSize) {
                    keysTmp.add(key);
                }
                if (i >= pageNum * pageSize) {
                    break;
                }
                i++;
            }

            return keysTmp;
        });
    }

jxblum commented 1 year ago

You might want to post this "question" on StackOverflow (here) to the [Spring Data] Redis community at large.

GitHub Issues for Spring Data Redis are for bugs, improvements and new features in SD Redis itself, or changes to the documentation, whether in the reference doc or Javadoc. This is not a QA or support forum.

Having said this, I am curious why you chose to perform a SCAN inside a RedisCallback using the RedisTemplate rather than using the RedisTemplate.scan(:ScanOptions) method (Javadoc), specifically?

The primary difference in approach here is that with RedisTemplate.scan(:ScanOptions) method (unlike your (manually executed SCAN above) will configure and use a "sticky" connection, which is likely necessary for the application to remember the prior Cursor location on subsequent CURSOR command invocations.

If you simply invoke RedisTemplate.execute(:RedisCallback<?>), then it is possible that a new connection will be acquired. Even if it is an existing (likely "pooled") connection, it is not necessarily the same connection used to perform the initial SCAN and therefore no prior state of the Cursors location would have been retained on the subsequent call.

I would simply change your code to do the following:

private Set<String> scanTargetKeys(String keyWords, Integer pageNum,Integer pageSize) {

    // ...

    ScanOptions scanOptions = ScanOptions.scanOptions()
        .match(finalKeyWords)
        .count(pageSize)
        .build()

    Cursor<byte[]> cursor = redisTemplate.scan(scanOptions);

    // iterator over the Cursor as before

}

Then when your scanTargetKeys(..) method is invoked again, this should give you the result your looking for???

If not, please provide a reproducer of the issue you are experiencing (preferably) as minimal test case or using a simple example.

wendao8469759 commented 1 year ago

There may be some issues with my description, but I actually want to implement the Redis command: scan cursor [MATCH pattern] [COUNT count] use cursor. next() is called, it starts iterating from the position of the previous cursor, rather than starting from 0. Currently, I use

Cursor<String> cursor = redisTemplate.scan(scanOptions);
cursor.next();

I found that every iteration starts from 0. If I have many redis keys, does it mean that I need to iterate from scratch to obtain the subsequent keys?

wendao8469759 commented 1 year ago

    private Set<String> scanTargetKeys2(String keyWords, Long pageNum, Long pageSize) {
        Set<String> keysTmp = new HashSet<>();
        if (StringUtil.isEmpty(keyWords)) {
            keyWords = "*";
        }
        String finalKeyWords = keyWords;

        ScanOptions scanOptions = ScanOptions.scanOptions()
                .match(finalKeyWords)
                .count(pageSize)
                .build();
        Cursor<String> cursor = redisTemplate.scan(scanOptions);
        while (cursor.hasNext()) {
            long cursorId = cursor.getCursorId();
            long position = cursor.getPosition();
            String key = cursor.next();
            System.out.println("cursorId = " + cursorId + " position = " + position);
            if (position >= (pageNum - 1) * pageSize && position < pageNum * pageSize) {
                keysTmp.add(key);
                if(keysTmp.size() == pageSize){
                    break;
                }
            }
            if (position >= pageNum * pageSize) {
                break;
            }
        }
        return keysTmp;
    }

When my pageSize=1 and pageNum=10; My loop has been executed 10 times. When my pageSize=2 and pageNum=10; My loop has been executed 20 times. By analogy, when my pageSize=100, my loop executes pageSize * pageNum times. Is this correct?

jxblum commented 1 year ago

I will need to think on this more...

I was re-reading about the Redis SCAN command (doc) and came across a few more interesting tidbits. Search for:

"Number of elements returned at every SCAN call"
"The COUNT option" (logically following the first read, and it seems COUNT may only be a "hint")
"Why SCAN may return all the items of an aggregate data type in a single call?"
...

Although, it seems like SCAN itself should do the right thing (because SCAN always uses a key space with hash tables, unlike SSCAN, HSCAN and ZSCAN)??

I need to write a test case to verify some things.

wendao8469759 commented 1 year ago

Yes, according to the explanation in the official website doc。 Time complexity: O(1) for every call. O(N) for a complete iteration, including enough command calls for the cursor to return back to 0. N is the number of elements inside the collection.

jxblum commented 1 year ago

PART 1 (of 2):

After a more careful and thorough analysis, I have learned more about Redis's behavior as well as limitations (questionably, and arguably, a bug) in Spring Data Redis.

NOTE: This will need the extended Spring Data team's attention and review before we move forward on this ticket.

The limitation (bug?) in Spring Data Redis (SD Redis) is that it does not take the current "cursor" position into account when subsequently running and repeating the Redis SCAN command.

If we trace the SCAN down from the RedisTemplate, starting here, first you see that it uses a Redis connection object to run the Redis SCAN command (here). Ultimately, SD Redis is going to use a KeyCommands implementation (either from Jedis or Lettuce, thus the "driver") to execute the Redis SCAN command using the Jedis or Lettuce connection, as seen here.

JEDIS

If we trace from Jedis, we see this block of code runs the SCAN using Jedis (Javadoc), for instance, here (typeless call).

We notice that the "cursorId" is the "cursor" argument to the Jedis.scan(cursor:byte[], :ScanOptions) call in this case. It is this "cursor" that will set the beginning for the next, or subsequent scan.

However, nothing calls this SD Redis JedisKeyCommands.scan(..) method other than the overloaded method above, which simply passes 0 for the current "cursor" position, thereby always causing the scan to start at the beginning.

LETTUCE

A similar outcome is revealed when tracing from Lettuce. Again, we see this block of code.

Ultimately, the limitation is induced by the SD Redis ScanOptions type (Javadoc), which does not give you an "option" for setting the current "cursor" position (a marker from a previous scan). This is important since the SD Redis ScanOptions gets "converted" to Lettuce ScanArgs (Javadoc) before executing the SCAN, invoked from here, and as seen here.

As you can see in the conversion, only the "pattern", "count" (LIMIT) and (optionally) type are taken into consideration, which is arguably a limitation derived from Lettuce's KeyScanArgs type (Javadoc) itself, which has no option to set the current "cursor" position.

We actually see the same limitation with Jedis when converting the SD Redis ScanOptions to Jedis ScanParams (Javadoc).

Although, like Jedis, which allows the current "cursor" position to be passed in the call to Jedis.scan(..), again as seen here, Lettuce provides the means to pass the current "cursor" position into the Lettuce driver when the SCAN is invoked, by "customizing" (or reconstructing) the already constructed Lettuce ScanCursor object (Javadoc) using the SD Redis ScanOptions object, from here to here.

SUMMARY

So, in summary, it is arguable and possible that Spring Data Redis could set the current position of the "cursor" for subsequent SCANs.

However, the question is, does it matter?

jxblum commented 1 year ago

PART 2:

Following on from PART 1 above, the question becomes, is setting the starting position of the cursor useful?

In your use case, specifically for "paging" the results, I'd argue that using the Redis SCAN command is not useful!

First of all, there is no guarantee that Redis will return the same order of the scanned results on subsequent scans.

Consider for example that you have a 6 element list of keys [ A, B, C, D, E, F ], with a page size of 2, then the cursor used to iterate the results of the SCAN is NOT guaranteed to return the same order of elements in subsequent scans (for the same pattern), even when given the current cursor position from a previous scan.

For instance, let's say the full (first) SCAN yields [ B, C, A, D, F, E ], then on the first SCAN (page size / LIMIT of 2) it would return [B, C] for the first page, [A, D] for the second page, and [F, E] for the third and final page.

However, if you run only a SCAN to get the first page [B, C], then rerun the SCAN to get the second page with the previous returned cursor position and count (LIMIT) of 2 to get the second page, it might return [B, A] where B already appeared on the first page from the initial scan.

Additionally, as I previously mentioned, the count (LIMIT) of SCAN is just a "suggestion", and actually might return 1 element, or perhaps 3 or more elements, even when the page size (count/LIMIT) was set to 2.

Again, this was called out in the Redis SCAN command docs, specifically in the "Scan guarantees", "Number of elements returned at every SCAN call" and "The COUNT option" sections.

EXAMPLE

I spent the better part of yesterday writing tests in a new GitHub Repository (redis-experiments) I setup to experiment with, learn and test the behaviors of Jedis, Lettuce and Spring Data Redis, independently as well as together (for example: Spring Data Redis with either Jedis or Lettuce).

In the Jedis-based RedisScanIntegrationTests (source) I have 2 test case methods that test the 2 problems I cited above: a test case for order and a test case for limit.

Of course, the test case to test order (scanIsConsistent) also tests limit.

Both tests can pass, however, most of the time, they fail.

While it is mostly easy to restrain the count/LIMIT, particularly if the results exceed your specified count/LIMIT (not so much when it is less) it is not possible to control ORDER.

Nothing in the docs for the Redis SCAN command reveal anything about the order in which the keys from the SCAN will be returned. Simply search for "order".

NOTE: FTR, I did try to write an equivalent test class using Spring Data Redis with both Jedis and Lettuce (default), to experiment with the Redis SCAN command, but after encountering the "limitation", it was, IMO, a pointless effort. Ultimately, It would not have mattered anyway given the behavior of the Redis SCAN command as I have described in PART 2.

CONCLUSION

So, in conclusion, I think your use of Redis's SCAN command for paging is misplaced. Redis is simply not that smart.

jxblum commented 1 year ago

@mp911de - Can you please provide additional insights and thoughts on this matter?

Specifically, what are your thoughts on support the current cursor position?

It is possible, but will only yield a small benefit.

I am reasonably confident my analysis is correct, but I recognize that my knowledge of Redis is not complete and may have missed something important.

Thank you!

mp911de commented 1 year ago

When my pageSize=1 and pageNum=10; My loop has been executed 10 times. When my pageSize=2 and pageNum=10; My loop has been executed 20 times. By analogy, when my pageSize=100, my loop executes pageSize * pageNum times. Is this correct?

Yes, it is.

Spring Data Redis is built for providing a more convenient abstraction to cater common use cases. You're looking for another optimization where you can retain the cursor and reuse it the next time you want to resume scanning.

There are two problems with that:

The result size isn't uniform and Redis can return less or more data on each iteration
Therefore, you cannot easily take a portion of data and assume you hand in the cursor and resume from that exact position.

As John mentioned, we already expose scan methods accepting a cursor at JedisKeyCommands. Therefore, we can easily expose these to RedisKeyCommands as well enabling your use-case through the connection API.

wendao8469759 commented 1 year ago

@jxblum @mp911de Thank you very much.

What you're saying is very correct. Redis scan itself has certain drawbacks, such as the possibility of duplicate keys in the returned results from two calls to scan.

So I understand why spring data redis is doing this.

mp911de commented 1 year ago

I updated the ticket to reflect its purpose.

jxblum commented 1 year ago

While it is true the JedisKeyCommands implementation (of the RedisKeyCommands interface) exposes an overloaded scan(..) method to pass in the "cursorId" as seen from source, which is then passed to the Jedis.scan(..) method (for instance, as seen here in the typeless case), there are 2 problems with this approach.

First, the JedisKeyCommmands implementation is not public (source).

Second, the equivalent LettuceKeyCommands implementation is inconsistent in that it does not expose an overloaded scan(..) method accepting the "cursorId" (source). Additionally, the LettuceKeyCommands implementation is also not public (source).

I would be reluctant to make either class public just to offer a new overloaded method (in Lettuce's case) where you could pass the "cursorId", and then do something of the effect:


    Cursor<byte[]> cursor = ((JedisKeyCommands) redisConnection.keyCommands())
        .scan(cursorId, myScanOptions);

First, this approach ties you to the Redis driver.

Second, while this allows you more control and low-level access, essentially moving you closer to the Redis driver API, you might as well just use the Redis driver's API at this point, then. What is the point of using an abstraction like Spring Data Redis (?), even if we are wrapping the Lettuce/Jedis driver API calls and providing additional facilities (like Conversion), still some of the benefits are lost, such as easily interchanging between Jedis and Lettuce.

The primary means of accessing data in any data store supported by Spring Data, and for Spring Data Redis that is Redis, of course, it either via a Template (such as the RedisTemplate) or the Spring Data [Redis] Repository abstraction.

Rather, I propose a slightly different approach...

jxblum commented 1 year ago

I propose we change the SD Redis ScanOptions type (Javadoc) to expose an API to set the "cursor", defaulting to 0 when not explicitly specified by the caller.

In this way, we do not need to expose more methods on the RedisKeyCommands interface (or higher up, in RedisTemplate / RedisOperations) that introduce yet another method with an overloaded signature, rather keeping to our abstractions.

Additionally, the scan(..), whether using Jedis or Lettuce, actually wrap the ScanOptions argument to scan(..) in the returned Cursor (Javadoc), which already has a getCursorId() method ([Javadoc](https://docs.spring.io/spring-data/redis/docs/current/api/org/springframework/data/redis/core/Cursor.html#getCursorId())).

In Spring Data Redis, the Cursor is the abstraction that represents the redis.clients.jedis.resps.ScanResult object returned by the Jedis scan(..) method and the io.lettuce.core.KeyScanCursor returned by the Lettuce scan.

Alternatively, we could, for completeness/convenience, offer a scan(cursorId:String, :ScanOptions) overloaded method on RedisKeyCommands interface, but I feel like this is unnecessary.

In my approach, you would simply:


ScanOptions scanOptions = ScanOptions.scanOptions()
    .match("..")
    .count(10)
    .build();

Cursor<?> cursor = redisTemplate.scan(scanOptions);

// process Cursor as necessary

ScanOptions nextScanOptions = ScanOptions.scanOptions()
    .from(scanOptions)
    .withCursorId(cursor.getCursorId())
    .build();

Cursor<?> nextCursor = redisTemplate.scan(nextScanOptions);

// process Cursor as necessary

In this way the APIs of the pertinent data access methods (such as the RedisTemplate) do not need to change.

Using the RedisKeyCommands approach, you would:

ScanOptions scanOptions = ScanOptions.scanOptions()
    .match("...")
    .count(10)
    .build();

Cursor<?> cursor = redisConnection.keyCommands().scan(scanOptions);

// process Cursor as necessary

Cursor<?> nextCursor = redisConnection.keyCommands().scan(cursor.getCursorId(), scanOptions);

// process Cursor as necessary

I suppose either way is not bad, but... ???

I'd much rather see us sticking to the pattern of using Sprig Data's primary data access facilities, a Template or a Spring Data Repository.

In Spring Boot, even if you are not using Spring Data Repositories (or maybe you even explicitly disabled Repository access; for instance) Spring Boot will nearly always auto-configure a RedisTemplate and StringRedisTemplate for you (source) in most cases.

jxblum commented 1 year ago

Following up to my earlier posts above, I just wanted to state that the changes to support cursor-based positioning on a SCAN is still a WIP.

I started by implementing the overloaded RedisKeyCommands interface, scan(..) method accepting a cursor (ID) as described in the title of this issue despite my argument that changing the ScanOptions would be a better approach.

It turns out, at least in the context of SD Redis, that the Cursor abstraction does not exactly do as I would expect, and certainly does not honor the behavior of the scan result (cursor) returned by the Redis SCAN command as executed directly with either the Lettuce or Jedis, Redis client driver API, such as my example demonstrated, where it will return partial results given a cursor (ID) from a prior SCAN. With SD Redis, it is all or nothing despite passing the cursor's current position.

It's possible I did something wrong, but it's equally likely that the Cursor abstraction is not correct.

For instance, scan(..) should NOT be called inside the ScanCursor's hasNext() method. This can lead to a StackOverflow if the state of the Cursor is out-of-sync.

The Cursor, much like a standard Java Iterator, should not advance the position of the cursor inside hasNext(); it should simply query whether results remain based on the current position of the iteration in the collection/stream of elements being iterated. Only next() is technically responsible for fetching the next single or batch/set of intermediate result(s). This is precisely backward in the SD Redis ScanCursor base class.

I need to re-look at this more closely when I return to work late next week. For the time being, this will need to wait, unless @mp911de wants to have a look in the meantime.

I put the changes on the topic/issue branch issue/2584 for the curious onlooker.

mp911de commented 1 year ago

The proposal to add the cursorId to ScanOptions makes a lot more sense. After all, ScanOptions denote the entry point to scanning. We need to keep in mind, that scanning in cluster mode will likely not work when providing a cursorId as cluster-scanning uses the node as part of its state.