sewenew / redis-plus-plus

Redis client written in C++
Apache License 2.0
1.6k stars 347 forks source link

[BUG] Failing assert when using AsyncRedisCluster with Role::SLAVE #472

Closed prm-james-hill closed 1 year ago

prm-james-hill commented 1 year ago

Describe the bug When constructing AsyncRedisCluster without specifying the role, there's no issue. When specifying the role, sending a command via the client sometimes fails the following assert:

Assertion `_state == State::ENABLE_READONLY' failed.

For what it's worth, the state value causing the assert to fail is "BROKEN".

To Reproduce

sw::redis::ConnectionOptions connection_opts;
connection_opts.host = ...;
connection_opts.port = ...;
connection_opts.connect_timeout = std::chrono::milliseconds(...);
connection_opts.socket_timeout = std::chrono::milliseconds(...);

auto client = std::make_unique<sw::redis::AsyncRedisCluster>(
    connection_opts, sw::redis::ConnectionPoolOptions(), sw::redis::Role::SLAVE);
client->command<std::optional<std::string>>(
    "get",
    "placeholder",
    [](sw::redis::Future<std::optional<std::string>> &&fut) {
      try {
        fut.get();
      } catch (const sw::redis::Error &err) {
        abort();
      }
    });

Expected behavior I expected the behavior to be the same as when the role was not specified.

Environment:

Additional context I'm a newbie to Redis so if this is actually expected behavior, my bad.

sewenew commented 1 year ago

Sorry, but I cannot reproduce your problem. Can you give the steps to reproduce the problem with the code you given? From the information you given, it seems that the connection is broken. Not sure if your slave nodes are down when the assertion hit.

I'm a newbie to Redis so if this is actually expected behavior, my bad.

NO, it's not an expected behavior. If I can reproduce the problem, I'll fix it ASAP.

Regards

prm-james-hill commented 1 year ago

Some details I can add:

If that still doesn't help, how hard do you estimate it'd be for me to debug this myself as a newbie to this repo too?

sewenew commented 1 year ago

Thanks for the info! I'll try to reproduce it.

how hard do you estimate it'd be for me to debug this myself as a newbie to this repo too

You can use GDB to debug it. Compile redis-plus-plus in Debug mode, i.e. run cmake with -DCMAKE_BUILD_TYPE=Debug, and run the code in GDB.

Please ensure that you installed one and only one version of hiredis. If your system installs multiple hiredis, e.g. one is installed with make, and the other is installed with apt-get, you might get some wired problem.

Also you can try to uninstall redis-plus-plus and hiredis, and do a reinstall before debugging the code.

Regards

prm-james-hill commented 1 year ago

@sewenew I spent some time looking into this today. The issue seems to be that, for a particular connection, the following series of events is possible:

  1. State is set to CONNECTING
  2. The READONLY command is sent
  3. State is set to ENABLE_READONLY
  4. State is set to BROKEN
  5. The response for the READONLY command arrives, and the state is still expected to be ENABLE_READONLY

It’s not clear to me how the ordering of the last two events is possible (e.g. if it’s network reordering, or if the BROKEN logic is triggered in redis-plus-plus before hiredis can use the callback for the READONLY command). Even if you can’t repro the issue, maybe it’s obvious to you how this is possible.

As far as I can tell, the broken connection never has any pending events, so a potential workaround is just having the first case in connect_callback()’s switch statement be:

        case State::BROKEN:
            break;

But there’s a lot I still don’t understand about the codebase so maybe this is bad (e.g. will these broken connections just stay alive and accumulate?)

sewenew commented 1 year ago

Thanks for your debugging info!

If step 4 happens, i.e. the connection is broken, it should not call readonly command’s callback, instead, it should call an error callback. I’m not sure it’s a hiredis bug, or I misused hiredis’ api.

Sorry, but I’m on vacation and cannot debug it. I’ll look into it when I go back to work.

Regards


发件人: James @.> 发送时间: Sunday, April 30, 2023 10:15:13 AM 收件人: sewenew/redis-plus-plus @.> 抄送: sewenew @.>; Mention @.> 主题: Re: [sewenew/redis-plus-plus] [BUG] Failing assert when using AsyncRedisCluster with Role::SLAVE (Issue #472)

@sewenewhttps://github.com/sewenew I spent some time looking into this today. The issue seems to be that, for a particular connection, the following series of events is possible:

  1. State is set to CONNECTING
  2. The READONLY command is sent
  3. State is set to ENABLE_READONLY
  4. State is set to BROKEN
  5. The response for the READONLY command arrives, and the state is still expected to be ENABLE_READONLY

It’s not clear to me how the ordering of the last two events is possible (e.g. if it’s network reordering, or if the BROKEN logic is triggered in redis-plus-plus before hiredis can use the callback for the READONLY command). Even if you can’t repro the issue, maybe it’s obvious to you how this is possible.

As far as I can tell, the broken connection never has any pending events, so a potential workaround is just having the first case in connect_callback()’s switch statement be:

    case State::BROKEN:
        break;

But there’s a lot I still don’t understand about the codebase so maybe this is bad (e.g. will these broken connections just stay alive and accumulate?)

― Reply to this email directly, view it on GitHubhttps://github.com/sewenew/redis-plus-plus/issues/472#issuecomment-1528921084, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACWWTAJDUU3HYGS7THVDP5TXDXDLDANCNFSM6AAAAAAW52P5EU. You are receiving this because you were mentioned.Message ID: @.***>

sewenew commented 1 year ago

@prm-james-hill I did some research, and found out that the connection might be closed, i.e. state is set to BROKEN, before the connect_callback is called. And your proposal should work, since in this case, all events of the closed connection have already been processed, and we can safely ignore this connect_callback.

I've already fix this problem, and you can test it with the latest code on master branch.

Thanks again for finding the bug! And sorry for the late reply...

Regards

prm-james-hill commented 1 year ago

Confirming that the latest commit resolves this issue. Thanks for getting this fixed!