ERR Protocol error: too big inline request (Redis::CommandError)

Jakanapes commented 1 year ago

Redis 5.3.3 Ruby 3.1.3 Rails 7.0.4.2 redis gem 5.0.6 resque gem 2.4.0

Just upgraded the redis/resque gems and their dependencies and am getting this intermittent error ERROR -- : ERR Protocol error: too big inline request (Redis::CommandError)

The backtrace is not super helpful

E, [2023-03-14T14:41:13.543261 #44489] ERROR -- : ERR Protocol error: too big inline request (Redis::CommandError)
.....shared/bundle/ruby/3.1.0/gems/redis-5.0.6/lib/redis/subscribe.rb:52:in `subscription'
.....shared/bundle/ruby/3.1.0/gems/redis-5.0.6/lib/redis/subscribe.rb:17:in `subscribe'
.....shared/bundle/ruby/3.1.0/gems/redis-5.0.6/lib/redis.rb:188:in `_subscription'
.....shared/bundle/ruby/3.1.0/gems/redis-5.0.6/lib/redis/commands/pubsub.rb:17:in `subscribe'
.....shared/bundle/ruby/3.1.0/gems/actioncable-7.0.4.2/lib/action_cable/subscription_adapter/redis.rb:83:in `block in listen'
.....shared/bundle/ruby/3.1.0/gems/redis-client-0.12.1/lib/redis_client.rb:645:in `ensure_connected'
.....shared/bundle/ruby/3.1.0/gems/redis-5.0.6/lib/redis/client.rb:104:in `disable_reconnection'
.....shared/bundle/ruby/3.1.0/gems/redis-5.0.6/lib/redis.rb:79:in `without_reconnect'
.....shared/bundle/ruby/3.1.0/gems/actioncable-7.0.4.2/lib/action_cable/subscription_adapter/redis.rb:80:in `listen'
.....shared/bundle/ruby/3.1.0/gems/actioncable-7.0.4.2/lib/action_cable/subscription_adapter/redis.rb:154:in `block in ensure_listener_running'

seems to indicate that action_cable is involved, but we use it a lot and having trouble tracking it down. I can't reproduce it and it appears on calls that don't trigger any actioncable actions.

The error is generated in redis, which has not changed, but looking at the gem difference, the error being raised in subscribe.rb is new. Am I looking at something that was happening all along, but was being swallowed by earlier versions of the redis gem?

byroot commented 1 year ago

So this error is thrown by the server here: https://github.com/redis/redis/blob/2dd5c3a180a9892c339b66f76cc203f8b0f6fe5a/src/networking.c#L2112-L2115

And that constant is defined as 64kB:

#define PROTO_INLINE_MAX_SIZE   (1024*64)

So this seem to suggest that somehow the client sends more than 64kB of data without a single newline, that is very fishy.

But somehow that error rings a bell, it feels like I researched this before.

Jakanapes commented 1 year ago

What would be sending inline, though? Looking at the ActionCable source, it's calling publish in Redis::Commands::Pubsub, I believe. We're trying to eliminate possible causes so we're tracking any AC broadcast messages over 64k, but they don't correspond to the redis protocol errors because they bubble up in a separate thread from the AC listener. Neither AC nor Redis versions have changed, just the redis gem getting bumped up from 4.1.0

byroot commented 1 year ago

What would be sending inline, though?

Yeah, I don't know, it could be an overly long channel name on something like that. It will be hard to debug without a repro.

I'm not sure what you setup look like, but if you have that flexibility I'd try to add some logging in there https://github.com/redis/redis-rb/blob/e4f5299aa0109d1334564f20ae8522866ef554e9/lib/redis/subscribe.rb#L49 to see what can of command is emitted and whether you can figure out a smaller repro?

Jakanapes commented 1 year ago

So looking at the backtrace, the line in actioncable that ends up at call_v is here

https://github.com/rails/rails/blob/7-0-stable/actioncable/lib/action_cable/subscription_adapter/redis.rb#L92

and it's explicitly sending _action_cable_internal, which is sent all the way through to call_v in the *channels params. Am I reading the code correctly there that the events pulled from client.next_event are a direct result of the call_v command?

I can try to add some logging there, but if that's the case, I don't see how it could be anything other than the hardcoded channel, let along a 64k long channel name

byroot commented 1 year ago

Hum, you're right.

Jakanapes commented 1 year ago

actioncable sets the message callback of the Subscription class to

                on.message do |chan, message|
                  broadcast(chan, message)
                end

If I'm reading it right, this block under call_v

      while event = @client.next_event(timeout)
        if event.is_a?(::RedisClient::CommandError)
          raise Client::ERROR_MAPPING.fetch(event.class), event.message
        end

        type, *rest = event
        if callback = sub.callbacks[type]
          callback.call(*rest)
        end
        break if type == stop && rest.last == 0
      end

is blocking while loop in its own thread? So it's likely that it subscribes to the channel, then waits for incoming messages. Eventually, it gets a message, calls the callback which attempts to broadcast it on the channel, and that is what triggers to the "too big" error, which it picks up on the next loop?

If that's the case, though, we've attempted sending large messages through to trigger the exception with no luck

byroot commented 1 year ago

Note that it could be a total red herring. When googling for that error, you can see lots of these being resolved in various ways on the server side.

Given that you say the issue started when you upgraded, it makes sense to assume the client doesn't something wrong, but that wrong thing may not be to send a packet too big.

just the redis gem getting bumped up from 4.1.0

Have you tried the latest 4.x? Mostly to try to pinpoint when the issue might have been introduced.

Another good thing to try would be to enable the hiredis bindings, see if that changes anything. That could help track the source of the issue.

Jakanapes commented 1 year ago

That is a concern, that since the exception being raised is new in this version, that it's something that could have been happening all along and was either being swallowed or surfaced as another error that we haven't noticed is no longer happening.

[1] pry(main)> Redis::Connection.drivers
=> [RedisClient::HiredisConnection]

we're already using the hiredis-client gem.

Is there a way to see if the AC redis instance is using inline? I thought resp was what was always used?

byroot commented 1 year ago

we're already using the hiredis-client gem.

Then the opposite, try to disable it. It's really just a basic bisection to try to narrow down where the bug could be.

Is there a way to see if the AC redis instance is using inline?

I'm not sure what you are referring to by inline.

byroot commented 1 year ago

Oh wait, Hum, another thing, is there SSL into the mix? I cant remember nor find where I saw that too big inline request before, but I wonder if it wasn't a client trying to connect to an unsecured server using SSL, and the SSL handshake paquets were tripping the server.

Just a wild guess though.

Jakanapes commented 1 year ago

It does look like the gem uses RESP2

      def config(**kwargs)
        super(protocol: 2, **kwargs)
      end

The error itself says "inline request" and the comments on the method in the redis source say "Like processMultibulkBuffer(), but for the inline protocol instead of RESP" https://github.com/redis/redis/blob/2dd5c3a180a9892c339b66f76cc203f8b0f6fe5a/src/networking.c#L2094-L2117

however, it seems like every client in existence uses resp.

I'll have to check on the SSL issue, but if that was the case, I'd expect it to happen on every message, no?

byroot commented 1 year ago

Oh wait, this is an intermittent issue?

byroot commented 1 year ago

however, it seems like every client in existence uses resp.

Yeah, I have no idea what "inline protocol" is supposed to be never heard of it, and I can confirm that redis-rb uses RESP2.

Is there a proxy or something between the client and Redis? Or anything else that isn't just a raw standalone redis-server?

Jakanapes commented 1 year ago

Oh wait, this is an intermittent issue?

Oh yes, intermittent and extremely elusive. We've had 0 luck in recreating it on purpose. We see the error 6-8 times a day among thousands and thousands of requests. No other issues we can tell on straight redis calls for caching and resque use, just something with action cable.

We use consul to manage the url, but ATM we only have a single redis server in production.

byroot commented 1 year ago

Hum, that is extremely intriguing. I'll try to find some time to review if it's possible that on 4.x that error was swallowed or retried. That could explain it, e.g. a pre-existing issue with your server, and now we're just reporting it.

Jakanapes commented 1 year ago

It looks like in the previous version, it looks like the errors were being raised in call_loop in client.rb, but the ProtocolError child class was overwriting initialize

  # Raised by the connection when a protocol error occurs.
  class ProtocolError < BaseError
    def initialize(reply_type)
      super(<<-EOS.gsub(/(?:^|\n)\s*/, " "))
        Got '#{reply_type}' as initial reply byte.
        If you're in a forking environment, such as Unicorn, you need to
        connect to Redis after forking.
      EOS
    end
  end

Matt-Yorkley commented 1 year ago

I'm also seeing this intermittently after upgrading redis 4.8.1 to redis 5.0.6 and switching out hiredis 0.6.3 to hiredis-client 0.14.1. Same error and backtrace, and it's also rare and difficult to pin down...

Jakanapes commented 1 year ago

Spent the time after my last attempt slow rolling the upgrade. Redis gem up to 4.8.1 and resque up to 2.5.0 with no issues.

We have since moved from our old redis servers to GCP memorystore, which hosts redis_version:6.2.11

In addition, there was a rails upgrade that fixed an ActionCable redis config with sentinel https://github.com/rails/rails/blob/v7.0.6/actioncable/CHANGELOG.md#rails-706-june-29-2023

The final upgrade was swapping out hiredis for hiredis-client and upgrading the redis gem to 5.0.6

Within an hour I had received the error from a very, very small action cable call

Jakanapes commented 1 year ago

hmm, in fact, I can't find us using ActionCable in the call that Airbrake captures as the source at all. Could be some sneaky metaprogramming somewhere, but I'm not finding it.

Jakanapes commented 1 year ago

@Matt-Yorkley, did you ever find any cause? I haven't been able to try again in a while, but I'm working on some more logging in the action cable process to see if I can track it down.

mkllnk commented 9 months ago

No, Matt didn't solve the issue for us. I just ran CI again on our latest code and still saw this error popping up. Since Sam claimed that hiredis doesn't bring any performance improvement I decided to remove the gem from our code and just use the default redis gem. Let me know if you come to another conclusion.

Jakanapes commented 9 months ago

No, Matt didn't solve the issue for us. I just ran CI again on our latest code and still saw this error popping up. Since Sam claimed that hiredis doesn't bring any performance improvement I decided to remove the gem from our code and just use the default redis gem. Let me know if you come to another conclusion.

I added more logging around what gets passed to the redis gem from action cable, but no smoking guns. Still very intermittent and haven't been able to consistently reproduce it.

Are you saying that removing hiredis eliminated the issue?

mkllnk commented 9 months ago

Are you saying that removing hiredis eliminated the issue?

Yes, at least in our CI environment. We haven't tested production yet.

redis / redis-rb

ERR Protocol error: too big inline request (Redis::CommandError) #1185