Open Jakanapes opened 1 year ago
So this error is thrown by the server here: https://github.com/redis/redis/blob/2dd5c3a180a9892c339b66f76cc203f8b0f6fe5a/src/networking.c#L2112-L2115
And that constant is defined as 64kB:
#define PROTO_INLINE_MAX_SIZE (1024*64)
So this seem to suggest that somehow the client sends more than 64kB of data without a single newline, that is very fishy.
But somehow that error rings a bell, it feels like I researched this before.
What would be sending inline, though? Looking at the ActionCable source, it's calling publish in Redis::Commands::Pubsub
, I believe. We're trying to eliminate possible causes so we're tracking any AC broadcast messages over 64k, but they don't correspond to the redis protocol errors because they bubble up in a separate thread from the AC listener. Neither AC nor Redis versions have changed, just the redis gem getting bumped up from 4.1.0
What would be sending inline, though?
Yeah, I don't know, it could be an overly long channel name on something like that. It will be hard to debug without a repro.
I'm not sure what you setup look like, but if you have that flexibility I'd try to add some logging in there https://github.com/redis/redis-rb/blob/e4f5299aa0109d1334564f20ae8522866ef554e9/lib/redis/subscribe.rb#L49 to see what can of command is emitted and whether you can figure out a smaller repro?
So looking at the backtrace, the line in actioncable that ends up at call_v
is here
and it's explicitly sending _action_cable_internal
, which is sent all the way through to call_v
in the *channels
params. Am I reading the code correctly there that the events pulled from client.next_event
are a direct result of the call_v
command?
I can try to add some logging there, but if that's the case, I don't see how it could be anything other than the hardcoded channel, let along a 64k long channel name
Hum, you're right.
actioncable sets the message callback of the Subscription
class to
on.message do |chan, message|
broadcast(chan, message)
end
If I'm reading it right, this block under call_v
while event = @client.next_event(timeout)
if event.is_a?(::RedisClient::CommandError)
raise Client::ERROR_MAPPING.fetch(event.class), event.message
end
type, *rest = event
if callback = sub.callbacks[type]
callback.call(*rest)
end
break if type == stop && rest.last == 0
end
is blocking while loop in its own thread? So it's likely that it subscribes to the channel, then waits for incoming messages. Eventually, it gets a message, calls the callback which attempts to broadcast it on the channel, and that is what triggers to the "too big" error, which it picks up on the next loop?
If that's the case, though, we've attempted sending large messages through to trigger the exception with no luck
Note that it could be a total red herring. When googling for that error, you can see lots of these being resolved in various ways on the server side.
Given that you say the issue started when you upgraded, it makes sense to assume the client doesn't something wrong, but that wrong thing may not be to send a packet too big.
just the redis gem getting bumped up from 4.1.0
Have you tried the latest 4.x? Mostly to try to pinpoint when the issue might have been introduced.
Another good thing to try would be to enable the hiredis
bindings, see if that changes anything. That could help track the source of the issue.
That is a concern, that since the exception being raised is new in this version, that it's something that could have been happening all along and was either being swallowed or surfaced as another error that we haven't noticed is no longer happening.
[1] pry(main)> Redis::Connection.drivers
=> [RedisClient::HiredisConnection]
we're already using the hiredis-client gem.
Is there a way to see if the AC redis instance is using inline? I thought resp was what was always used?
we're already using the hiredis-client gem.
Then the opposite, try to disable it. It's really just a basic bisection to try to narrow down where the bug could be.
Is there a way to see if the AC redis instance is using inline?
I'm not sure what you are referring to by inline
.
Oh wait, Hum, another thing, is there SSL into the mix? I cant remember nor find where I saw that too big inline
request before, but I wonder if it wasn't a client trying to connect to an unsecured server using SSL, and the SSL handshake paquets were tripping the server.
Just a wild guess though.
It does look like the gem uses RESP2
def config(**kwargs)
super(protocol: 2, **kwargs)
end
The error itself says "inline request" and the comments on the method in the redis source say "Like processMultibulkBuffer(), but for the inline protocol instead of RESP" https://github.com/redis/redis/blob/2dd5c3a180a9892c339b66f76cc203f8b0f6fe5a/src/networking.c#L2094-L2117
however, it seems like every client in existence uses resp.
I'll have to check on the SSL issue, but if that was the case, I'd expect it to happen on every message, no?
Oh wait, this is an intermittent issue?
however, it seems like every client in existence uses resp.
Yeah, I have no idea what "inline protocol" is supposed to be never heard of it, and I can confirm that redis-rb
uses RESP2.
Is there a proxy or something between the client and Redis? Or anything else that isn't just a raw standalone redis-server?
Oh wait, this is an intermittent issue?
Oh yes, intermittent and extremely elusive. We've had 0 luck in recreating it on purpose. We see the error 6-8 times a day among thousands and thousands of requests. No other issues we can tell on straight redis calls for caching and resque use, just something with action cable.
We use consul to manage the url, but ATM we only have a single redis server in production.
Hum, that is extremely intriguing. I'll try to find some time to review if it's possible that on 4.x that error was swallowed or retried. That could explain it, e.g. a pre-existing issue with your server, and now we're just reporting it.
It looks like in the previous version, it looks like the errors were being raised in call_loop
in client.rb, but the ProtocolError
child class was overwriting initialize
# Raised by the connection when a protocol error occurs.
class ProtocolError < BaseError
def initialize(reply_type)
super(<<-EOS.gsub(/(?:^|\n)\s*/, " "))
Got '#{reply_type}' as initial reply byte.
If you're in a forking environment, such as Unicorn, you need to
connect to Redis after forking.
EOS
end
end
I'm also seeing this intermittently after upgrading redis 4.8.1
to redis 5.0.6
and switching out hiredis 0.6.3
to hiredis-client 0.14.1
. Same error and backtrace, and it's also rare and difficult to pin down...
Spent the time after my last attempt slow rolling the upgrade. Redis gem up to 4.8.1 and resque up to 2.5.0 with no issues.
We have since moved from our old redis servers to GCP memorystore, which hosts redis_version:6.2.11
In addition, there was a rails upgrade that fixed an ActionCable redis config with sentinel https://github.com/rails/rails/blob/v7.0.6/actioncable/CHANGELOG.md#rails-706-june-29-2023
The final upgrade was swapping out hiredis for hiredis-client and upgrading the redis gem to 5.0.6
Within an hour I had received the error from a very, very small action cable call
hmm, in fact, I can't find us using ActionCable in the call that Airbrake captures as the source at all. Could be some sneaky metaprogramming somewhere, but I'm not finding it.
@Matt-Yorkley, did you ever find any cause? I haven't been able to try again in a while, but I'm working on some more logging in the action cable process to see if I can track it down.
No, Matt didn't solve the issue for us. I just ran CI again on our latest code and still saw this error popping up. Since Sam claimed that hiredis doesn't bring any performance improvement I decided to remove the gem from our code and just use the default redis gem. Let me know if you come to another conclusion.
No, Matt didn't solve the issue for us. I just ran CI again on our latest code and still saw this error popping up. Since Sam claimed that hiredis doesn't bring any performance improvement I decided to remove the gem from our code and just use the default redis gem. Let me know if you come to another conclusion.
I added more logging around what gets passed to the redis gem from action cable, but no smoking guns. Still very intermittent and haven't been able to consistently reproduce it.
Are you saying that removing hiredis eliminated the issue?
Are you saying that removing hiredis eliminated the issue?
Yes, at least in our CI environment. We haven't tested production yet.
Redis 5.3.3 Ruby 3.1.3 Rails 7.0.4.2 redis gem 5.0.6 resque gem 2.4.0
Just upgraded the redis/resque gems and their dependencies and am getting this intermittent error ERROR -- : ERR Protocol error: too big inline request (Redis::CommandError)
The backtrace is not super helpful
seems to indicate that action_cable is involved, but we use it a lot and having trouble tracking it down. I can't reproduce it and it appears on calls that don't trigger any actioncable actions.
The error is generated in redis, which has not changed, but looking at the gem difference, the error being raised in subscribe.rb is new. Am I looking at something that was happening all along, but was being swallowed by earlier versions of the redis gem?