[BUG] Failing assert when using AsyncRedisCluster with Role::SLAVE

prm-james-hill commented 1 year ago

Describe the bug When constructing AsyncRedisCluster without specifying the role, there's no issue. When specifying the role, sending a command via the client sometimes fails the following assert:

Assertion `_state == State::ENABLE_READONLY' failed.

For what it's worth, the state value causing the assert to fail is "BROKEN".

To Reproduce

sw::redis::ConnectionOptions connection_opts;
connection_opts.host = ...;
connection_opts.port = ...;
connection_opts.connect_timeout = std::chrono::milliseconds(...);
connection_opts.socket_timeout = std::chrono::milliseconds(...);

auto client = std::make_unique<sw::redis::AsyncRedisCluster>(
    connection_opts, sw::redis::ConnectionPoolOptions(), sw::redis::Role::SLAVE);
client->command<std::optional<std::string>>(
    "get",
    "placeholder",
    [](sw::redis::Future<std::optional<std::string>> &&fut) {
      try {
        fut.get();
      } catch (const sw::redis::Error &err) {
        abort();
      }
    });

Expected behavior I expected the behavior to be the same as when the role was not specified.

Environment:

OS: Ubuntu 22.04.2 LTS
Compiler: clang 15
hiredis version: v1.1.0 (c14775b)
redis-plus-plus version: 78f4e04

Additional context I'm a newbie to Redis so if this is actually expected behavior, my bad.

sewenew commented 1 year ago

Sorry, but I cannot reproduce your problem. Can you give the steps to reproduce the problem with the code you given? From the information you given, it seems that the connection is broken. Not sure if your slave nodes are down when the assertion hit.

I'm a newbie to Redis so if this is actually expected behavior, my bad.

NO, it's not an expected behavior. If I can reproduce the problem, I'll fix it ASAP.

Regards

prm-james-hill commented 1 year ago

Some details I can add:

I'm making many (i.e. hardware_concurrency) of these clients at once. I can even repro it from creating and commanding them sequentially but it's much, much rarer. I'm guessing the concurrency just makes it more likely for the connection to be broken.
I can't repro it with a local cluster (presumably because the connection never breaks). FYI that my cluster is in AWS ElastiCache, though I assume this doesn't introduce any unique circumstances on the client side. I can't see any evidence of the slave nodes being down.

If that still doesn't help, how hard do you estimate it'd be for me to debug this myself as a newbie to this repo too?

sewenew commented 1 year ago

Thanks for the info! I'll try to reproduce it.

how hard do you estimate it'd be for me to debug this myself as a newbie to this repo too

You can use GDB to debug it. Compile redis-plus-plus in Debug mode, i.e. run cmake with -DCMAKE_BUILD_TYPE=Debug, and run the code in GDB.

Please ensure that you installed one and only one version of hiredis. If your system installs multiple hiredis, e.g. one is installed with make, and the other is installed with apt-get, you might get some wired problem.

Also you can try to uninstall redis-plus-plus and hiredis, and do a reinstall before debugging the code.

Regards

prm-james-hill commented 1 year ago

@sewenew I spent some time looking into this today. The issue seems to be that, for a particular connection, the following series of events is possible:

State is set to CONNECTING
The READONLY command is sent
State is set to ENABLE_READONLY
State is set to BROKEN
The response for the READONLY command arrives, and the state is still expected to be ENABLE_READONLY

It’s not clear to me how the ordering of the last two events is possible (e.g. if it’s network reordering, or if the BROKEN logic is triggered in redis-plus-plus before hiredis can use the callback for the READONLY command). Even if you can’t repro the issue, maybe it’s obvious to you how this is possible.

As far as I can tell, the broken connection never has any pending events, so a potential workaround is just having the first case in connect_callback()’s switch statement be:

        case State::BROKEN:
            break;

But there’s a lot I still don’t understand about the codebase so maybe this is bad (e.g. will these broken connections just stay alive and accumulate?)

sewenew commented 1 year ago

Thanks for your debugging info!

If step 4 happens, i.e. the connection is broken, it should not call readonly command’s callback, instead, it should call an error callback. I’m not sure it’s a hiredis bug, or I misused hiredis’ api.

Sorry, but I’m on vacation and cannot debug it. I’ll look into it when I go back to work.

Regards

发件人: James @.> 发送时间: Sunday, April 30, 2023 10:15:13 AM 收件人: sewenew/redis-plus-plus @.> 抄送: sewenew @.>; Mention @.> 主题: Re: [sewenew/redis-plus-plus] [BUG] Failing assert when using AsyncRedisCluster with Role::SLAVE (Issue #472)

@sewenewhttps://github.com/sewenew I spent some time looking into this today. The issue seems to be that, for a particular connection, the following series of events is possible:

State is set to CONNECTING
The READONLY command is sent
State is set to ENABLE_READONLY
State is set to BROKEN
The response for the READONLY command arrives, and the state is still expected to be ENABLE_READONLY

It’s not clear to me how the ordering of the last two events is possible (e.g. if it’s network reordering, or if the BROKEN logic is triggered in redis-plus-plus before hiredis can use the callback for the READONLY command). Even if you can’t repro the issue, maybe it’s obvious to you how this is possible.

As far as I can tell, the broken connection never has any pending events, so a potential workaround is just having the first case in connect_callback()’s switch statement be:

    case State::BROKEN:
        break;

But there’s a lot I still don’t understand about the codebase so maybe this is bad (e.g. will these broken connections just stay alive and accumulate?)

― Reply to this email directly, view it on GitHubhttps://github.com/sewenew/redis-plus-plus/issues/472#issuecomment-1528921084, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACWWTAJDUU3HYGS7THVDP5TXDXDLDANCNFSM6AAAAAAW52P5EU. You are receiving this because you were mentioned.Message ID: @.***>

sewenew commented 1 year ago

@prm-james-hill I did some research, and found out that the connection might be closed, i.e. state is set to BROKEN, before the connect_callback is called. And your proposal should work, since in this case, all events of the closed connection have already been processed, and we can safely ignore this connect_callback.

I've already fix this problem, and you can test it with the latest code on master branch.

Thanks again for finding the bug! And sorry for the late reply...

Regards

prm-james-hill commented 1 year ago

Confirming that the latest commit resolves this issue. Thanks for getting this fixed!

sewenew / redis-plus-plus

[BUG] Failing assert when using AsyncRedisCluster with Role::SLAVE #472