sewenew / redis-plus-plus

Redis client written in C++
Apache License 2.0
1.64k stars 351 forks source link

[QUESTION] Correct way to handle TimeoutError due to EAGAIN from Redis::subscriber? #327

Closed steve-lorimer closed 2 years ago

steve-lorimer commented 2 years ago

Thanks for a great package!

Describe the problem

Attempting to obtain a Subscriber from a Redis connection fails with a TimeoutError exception being thrown, with the description being "Resource temporarily unavailable" (ie: errno EAGAIN)

terminate called after throwing an instance of 'sw::redis::TimeoutError'
  what():  Failed to get reply: Resource temporarily unavailable

Thread 2 "sg_redis" received signal SIGABRT, Aborted.

Stacktrace shows this is from Connection::recv as a result of Connection::_auth during call to Redis::subscriber

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff7bda859 in __GI_abort () at abort.c:79
#2  0x0000000000425819 in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95
#3  0x000000000066582a in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#4  0x0000000000665895 in std::terminate () at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:58
#5  0x00000000006659e8 in __cxxabiv1::__cxa_throw (obj=<optimized out>, tinfo=0x724160 <typeinfo for sw::redis::TimeoutError>, dest=0x633a20 <sw::redis::TimeoutError::~TimeoutError()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:95
#6  0x0000000000420d97 in sw::redis::throw_error(redisContext const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) [clone .cold] ()
#7  0x000000000062b3cc in sw::redis::Connection::recv(bool) () at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:52
#8  0x000000000062b5de in sw::redis::Connection::_auth() () at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:52
#9  0x000000000062b8d9 in sw::redis::Connection::_set_options() () at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:52
#10 0x000000000062badd in sw::redis::Connection::Connection(sw::redis::ConnectionOptions const&) () at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:52
#11 0x0000000000631034 in sw::redis::ConnectionPool::create() () at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:52
#12 0x00000000006349f4 in sw::redis::Redis::subscriber() () at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:52

What is the correct way to handle this?

At the moment I'm just attempting to create a subscriber in a loop, and if a TimeoutError exception is thrown, I just try again, up to some arbitrarily picked max number of retries... it seems to work... but it also seems inelegant / a bit of a hack.

sw::redis::Subscriber subscriber(sw::redis::Redis& redis)
{
  for (int i = 0; i < kMaxSubscribeRetries; ++i)
  {
    try
    {
      sw::redis::Subscriber sub{redis.subscriber()};
      sub.subscribe("foo.bar");
      return sub;
    }
    catch (const sw::redis::TimeoutError&)
    {
      LOG_WARNING("timeout creating redis subscriber; sleeping for a bit and trying again");
      std::this_thread::sleep_for(std::chrono::milliseconds(100));
    }
  }
  throw std::runtime_error("failed to create redis subscriber");
}

What is the recommended way for obtaining a Subscriber please?

Environment:

steve-lorimer commented 2 years ago

Hmm, actually the retry-loop didn't help... it still can't connect even after trying multiple times in a loop.

What's really weird is that the binary, when running on a different server, is able to connect fine; and redis-cli is able to connect from the problematic server... something to do with the combination of the server and the binary using redis-plus-plus is causing issues.

wingunder commented 2 years ago

Hi @skebanga,

What is the recommended way for obtaining a Subscriber please?

Please see this post: https://github.com/sewenew/redis-plus-plus/issues/307#issuecomment-1002632214 Also read the chapter on Publish/Subscribe in the top level README.md and refer to the example in that chapter.

something to do with the combination of the server and the binary using redis-plus-plus is causing issues.

Try to see if the same libs are being used on all your servers. Run the following command on all your servers and compare the results carefully:

$ ldd your_binary

In order to be absolutely sure that your libs and binaries are all the same on all servers, you could compare their checksums:

$ ldd your_binary |awk -F'(' '{print $1}' |awk -F'=>' '{print $2}' |grep -v '^\s*$' |xargs sha256sum
$ sha256sum your_binary

You probably checked the connection (to make sure you don't have a network issue or a firewall problem) with redis-cli. If not, please try the following from all servers as well:

$ redis-cli -h <redis_server> -p <redis_port> ping

Regards

steve-lorimer commented 2 years ago

Thanks for the comment!

I am statically linking the redis libs, so that's not an issue

# ldd test_redis
        linux-vdso.so.1 (0x00007ffeb956a000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f2e50cf5000)
        libgcc_s.so.1 => /usr/local/gcc-11.2.0/lib64/libgcc_s.so.1 (0x00007f2e50cda000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f2e50cb7000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2e50ac5000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f2e50e4e000)

I have also managed to boil the issue down to the following, which shows that it is failing on issuing the AUTH command to redis:

#include <sw/redis++/redis++.h>
#include <iostream>
#include <chrono>

int main()
{
  sw::redis::ConnectionOptions opts("tcp://10.10.6.106:6379");

  opts.user = "default";
  opts.password = "XXXXXXXX";

  opts.socket_timeout = std::chrono::milliseconds(100);

  sw::redis::Redis r{opts}; 
  sw::redis::Subscriber sub{r.subscriber()};

  return 0;
}

It is working with redis_cli, but not with the above test app...

running both through strace I can see the following:

redis_cli:

socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
fcntl(3, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
connect(3, {sa_family=AF_INET, sin_port=htons(6379), sin_addr=inet_addr("10.10.6.106")}, 16) = -1 EINPROGRESS (Operation now in progress)
poll([{fd=3, events=POLLOUT}], 1, -1)   = 1 ([{fd=3, revents=POLLOUT}])
getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
fcntl(3, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(3, F_SETFL, O_RDWR)               = 0
setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(3, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
setsockopt(3, SOL_TCP, TCP_KEEPIDLE, [15], 4) = 0
setsockopt(3, SOL_TCP, TCP_KEEPINTVL, [5], 4) = 0
setsockopt(3, SOL_TCP, TCP_KEEPCNT, [3], 4) = 0
write(3, "*2\r\n$4\r\nAUTH\r\n$50\r\n4XXXXXXXX"..., 71) = 71
read(3, "+OK\r\n", 16384)               = 5

test_redis:

socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
fcntl(3, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
connect(3, {sa_family=AF_INET, sin_port=htons(6379), sin_addr=inet_addr("10.10.6.106")}, 16) = -1 EINPROGRESS (Operation now in progress)
poll([{fd=3, events=POLLOUT}], 1, -1)   = 1 ([{fd=3, revents=POLLOUT}])
connect(3, {sa_family=AF_INET, sin_port=htons(6379), sin_addr=inet_addr("10.10.6.106")}, 16) = 0
setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
fcntl(3, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(3, F_SETFL, O_RDWR)               = 0
setsockopt(3, SOL_SOCKET, SO_RCVTIMEO_OLD, "\0\0\0\0\0\0\0\0\240\206\1\0\0\0\0\0", 16) = 0
setsockopt(3, SOL_SOCKET, SO_SNDTIMEO_OLD, "\0\0\0\0\0\0\0\0\240\206\1\0\0\0\0\0", 16) = 0
sendto(3, "*2\r\n$4\r\nAUTH\r\n$50\r\n4XXXXXXXX"..., 71, 0, NULL, 0) = 71
recvfrom(3, 0x7ffe9ddbbbd0, 16384, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)

So major differences I can see are:

I'm a bit lost at this point tbh! :(

steve-lorimer commented 2 years ago

I ran my test program on a server which works, and got the following strace output:

socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
fcntl(3, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
connect(3, {sa_family=AF_INET, sin_port=htons(6379), sin_addr=inet_addr("10.10.6.106")}, 16) = -1 EINPROGRESS (Operation now in progress)
poll([{fd=3, events=POLLOUT}], 1, -1)   = 1 ([{fd=3, revents=POLLOUT}])
connect(3, {sa_family=AF_INET, sin_port=htons(6379), sin_addr=inet_addr("10.10.6.106")}, 16) = 0
setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
fcntl(3, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(3, F_SETFL, O_RDWR)               = 0
setsockopt(3, SOL_SOCKET, SO_RCVTIMEO_OLD, "\0\0\0\0\0\0\0\0\240\206\1\0\0\0\0\0", 16) = 0
setsockopt(3, SOL_SOCKET, SO_SNDTIMEO_OLD, "\0\0\0\0\0\0\0\0\240\206\1\0\0\0\0\0", 16) = 0
sendto(3, "*2\r\n$4\r\nAUTH\r\n$50\r\n4Vg2dTZVyuh2b"..., 71, 0, NULL, 0) = 71
recvfrom(3, "+OK\r\n", 16384, 0, NULL, NULL) = 5

So literally the only difference between the two is that the failing server's recvfrom is:

recvfrom(3, 0x7ffe9ca7aa40, 16384, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)

whereas the working server's recvfrom is:

recvfrom(3, "+OK\r\n", 16384, 0, NULL, NULL) = 5
steve-lorimer commented 2 years ago

Well - some progress it seems!

The culprit is opts.socket_timeout, and increasing it to a higher value means the recvfrom call succeeds.

I'll admit I'm no redis expert... but would it be fair to say it seems a bit weird that 100ms is too low?

sewenew commented 2 years ago

@skebanga Yes, I think the problem is that socket_timeout is too small for your redis server.

redis-cli by default might not set a timeout, so it does not have the problem. If you set socket_timeout to 0 when using redis-plus-plus, it should work fine too.

You’d better check if your redis server in a healthy status, or the network between your application and redis is not fast enough.

Regards

steve-lorimer commented 2 years ago

@sewenew thanks for the comment! The only issue with setting a socket_timeout to 0 is that when I enter the consume loop it will never return, and so then I can never check my std::stop_token and break the loop when I want the thread to stop.

  while (!token.stop_requested()) 
  {
    try
    {
      sub.consume(); // without socket_timeout this call never returns if there are no published messages
    }
    catch (const sw::redis::TimeoutError&)
    {
      // by using socket_timeout we can catch the TimeoutError and check the std::stop_token
    }
    catch (const sw::redis::Error& e)
    {
      LOG_ERROR("redis error: ", e.what());
      break;
    }
  }

Ideally there would be a thread-safe way to cause sub.consume() to return, thereby allowing the thread to exit

Mephalich commented 2 years ago

It looks like same issue I have posted several days ago here (https://github.com/sewenew/redis-plus-plus/issues/323). For now we have a workaround: just publish a fake message to one of the subscribed channels to rotate the loop.

@sewenew it appears some general solution to this will be highly appreciated. 👍

Best regards

steve-lorimer commented 2 years ago

just publish a fake message to one of the subscribed channels to rotate the loop.

:sob:

sewenew commented 2 years ago

@skebanga

The only issue with setting a socket_timeout to 0 is that when I enter the consume loop it will never return, and so then I can never check my std::stop_token and break the loop when I want the thread to stop.

You don't need to set it to 0. Your problem is that your network latency is too big, and if the socket_timeout is smaller than it, it cannot do authentication and subscription before consuming messages.

In order to make it work you need to set socket_timeout larger than the network latency.

Ideally there would be a thread-safe way to cause sub.consume() to return, thereby allowing the thread to exit

As @Mephalich mentioned you can send a termination message to the channel. Once consume is called, it blocks until there's a message or timeout reaches.

Regards

steve-lorimer commented 2 years ago

@sewenew thanks for the comments! Yes, understand the need to set timeout higher than network latency in this case...

However, I think in an ideal world it would be nice to be able to set zero timeout, as in I don't care how long the authentication and subscription takes, just so long as it completes... but also be able to effect a break out of the consume loop when I want to shut down.

Forcing the user to have to set a timeout just so they can exit the loop is awkward...

Anyway, I'll close the issue, because my question has been answered (even if I don't like the answer!) :)