resque / resque-scheduler

A light-weight job scheduling system built on top of Resque
MIT License
1.74k stars 481 forks source link

Redis 5 API for multi/pipelined commands changes transaction behavior in resque-scheduler 4.9+ #787

Open ajvondrak opened 8 months ago

ajvondrak commented 8 months ago

When upgrading to resque-scheduler 4.9.0 and above with redis-rb 4.x, we still get the deprecation notices related to https://github.com/resque/resque/issues/1794 and https://github.com/resque/resque-scheduler/issues/745.

Repro:

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "resque", "~> 2.6"
  gem "resque-scheduler", "~> 4.9"
  gem "redis", "~> 4.0"
end

Resque.redis = Redis.new # NOTE: need a fresh redis-server running locally

class Job
  @queue = "jobs"

  def self.perform
    puts "performed job"
  end
end

Resque.enqueue_in(1, Job)
sleep(1)
Resque::Scheduler.handle_delayed_items
puts "enqueued delayed items"
Resque::Worker.new("jobs").work(0)
$ ruby repro.rb
resque-scheduler: [INFO] 2023-12-27T11:11:35-08:00: Processing Delayed Items
Redis#srem will always return an Integer in Redis 5.0.0. Use Redis#srem? instead.(called from: /Users/alex.vondrak/.local/share/rtx/installs/ruby/3.2.2/lib/ruby/gems/3.2.0/gems/redis-namespace-1.11.0/lib/redis/namespace.rb:564:in `wrapped_send')
Pipelining commands on a Redis instance is deprecated and will be removed in Redis 5.0.0.

redis.multi do
  redis.get("key")
end

should be replaced by

redis.multi do |pipeline|
  pipeline.get("key")
end

(called from /Users/alex.vondrak/.local/share/rtx/installs/ruby/3.2.2/lib/ruby/gems/3.2.0/gems/redis-namespace-1.11.0/lib/redis/namespace.rb:564:in `wrapped_send'}
enqueued delayed items
performed job

After enough poking around through stack traces, I believe I've discovered the issue, and it's a subtle interaction between the resque & resque-scheduler gems:

  1. resque-scheduler calls Resque.redis.multi here: https://github.com/resque/resque-scheduler/blob/462e33e48a799f3bd89c305e1f6fe24996810bb5/lib/resque/scheduler.rb#L253-L259
  2. through a chain of calls, this hits Resque::Job.create: https://github.com/resque/resque-scheduler/blob/462e33e48a799f3bd89c305e1f6fe24996810bb5/lib/resque/scheduler.rb#L327
  3. resque defines Job.create in such a way that we effectively make a call to Resque.redis.pipelined: https://github.com/resque/resque/blob/2f9d080ce86eb2e3f1f3d47599a21c576124c6f3/lib/resque/data_store.rb#L105-L108

Copying the shape of the above into a direct repro, we're basically doing

# Causes deprecation warning
Resque.redis.multi do |m|
  Resque.redis.pipelined do |p|
    # ...
  end
end

instead of what the redis gem wants, which is

# Does not cause deprecation warning
Resque.redis.multi do |m|
  m.pipelined do |p|
    # ...
  end
end

It might be hard to actually achieve the latter, though, since the code is straddling two gems. 😕

We were not getting this warning prior to resque-scheduler 4.9 since the outer multi was added by https://github.com/resque/resque-scheduler/pull/767 (which also causes other nested transaction issues, like in https://github.com/resque/resque-scheduler/issues/773).

It seems like this still works circa redis 5.x. At least when switching the version in the above repro, I get:

$ ruby repro.rb
resque-scheduler: [INFO] 2023-12-27T11:14:06-08:00: Processing Delayed Items
enqueued delayed items
performed job

I think the deprecation may just kind of be firing in a wonky way due to how redis 4.x tries to juggle state.

Circa 4.x:

Circa 5.x:

~However, it's unclear to me what the interaction will be between two "top level" calls to these transaction commands. I.e., will the futures fire with the desired timing, or do they go out of sync in some way that may cause an issue?~ EDIT: per investigation below, it seems that the "nested" pipeline actually silently fires outside of the new MULTI/EXEC transaction around the batch; see https://github.com/resque/resque-scheduler/issues/787#issuecomment-1870717786 for details.

ajvondrak commented 8 months ago

However, it's unclear to me what the interaction will be between two "top level" calls to these transaction commands. I.e., will the futures fire with the desired timing, or do they go out of sync in some way that may cause an issue?

I think I figured this part out by adding some Redis logging to the repro script:

repro.rb ```ruby require "bundler/inline" gemfile do source "https://rubygems.org" gem "resque", "~> 2.6" gem "resque-scheduler", "~> 4.9" gem "redis", "~> #{ENV["REDIS_VERSION"]}.0" end case ENV["REDIS_VERSION"] when "4" $stdout.sync = true logger = Logger.new($stdout, level: Logger::DEBUG) Resque.redis = Redis.new(logger: logger) when "5" RedisClient.register(Module.new do def call(command, ...) puts "[call] #{command.flatten.join(" ")}" super end def call_pipelined(commands, ...) puts "[call_pipelined] #{commands.flatten.join(" ")}" super end end) Resque.redis = Redis.new end class Job @queue = "jobs" end Resque.redis.redis.flushdb # to ensure we start with a blank slate Resque.enqueue_in(1, Job) sleep(1) Resque::Scheduler.handle_delayed_items ```
Redis 4 ```console $ REDIS_VERSION=4 ruby repro.rb D, [2023-12-27T16:17:24.233121 #49036] DEBUG -- : [Redis] command=FLUSHDB args= D, [2023-12-27T16:17:24.234243 #49036] DEBUG -- : [Redis] call_time=0.80 ms D, [2023-12-27T16:17:24.234376 #49036] DEBUG -- : [Redis] command=RPUSH args="resque:delayed:1703722645" "{\"class\":\"Job\",\"args\":[],\"queue\":\"jobs\"}" D, [2023-12-27T16:17:24.234501 #49036] DEBUG -- : [Redis] call_time=0.11 ms D, [2023-12-27T16:17:24.234597 #49036] DEBUG -- : [Redis] command=SADD args="resque:timestamps:{\"class\":\"Job\",\"args\":[],\"queue\":\"jobs\"}" ["delayed:1703722645"] D, [2023-12-27T16:17:24.234710 #49036] DEBUG -- : [Redis] call_time=0.10 ms D, [2023-12-27T16:17:24.234730 #49036] DEBUG -- : [Redis] command=ZADD args="resque:delayed_queue_schedule" 1703722645 1703722645 D, [2023-12-27T16:17:24.234850 #49036] DEBUG -- : [Redis] call_time=0.11 ms D, [2023-12-27T16:17:25.238398 #49036] DEBUG -- : [Redis] command=ZRANGEBYSCORE args="resque:delayed_queue_schedule" "-inf" 1703722645 "LIMIT" 0 1 D, [2023-12-27T16:17:25.241675 #49036] DEBUG -- : [Redis] call_time=2.68 ms resque-scheduler: [INFO] 2023-12-27T16:17:25-08:00: Processing Delayed Items D, [2023-12-27T16:17:25.245827 #49036] DEBUG -- : [Redis] command=INFO args= D, [2023-12-27T16:17:25.246626 #49036] DEBUG -- : [Redis] call_time=0.69 ms D, [2023-12-27T16:17:25.255685 #49036] DEBUG -- : [Redis] command=SCRIPT args="load" "if redis.call('SETNX', KEYS[1], ARGV[1]) == 1\nthen\n redis.call('EXPIRE', KEYS[1], 180)\n return 1\nelse\n return 0\nend\n" D, [2023-12-27T16:17:25.256571 #49036] DEBUG -- : [Redis] call_time=0.74 ms D, [2023-12-27T16:17:25.256740 #49036] DEBUG -- : [Redis] command=EVALSHA args="8f4a4e422bfa8f9a0931e350d6e30b1c3ce97c33" 1 "resque:resque:resque_scheduler_master_lock" "localhost:49036" D, [2023-12-27T16:17:25.257156 #49036] DEBUG -- : [Redis] call_time=0.38 ms D, [2023-12-27T16:17:25.257338 #49036] DEBUG -- : [Redis] command=LRANGE args="resque:delayed:1703722645" 0 99 D, [2023-12-27T16:17:25.257677 #49036] DEBUG -- : [Redis] call_time=0.30 ms D, [2023-12-27T16:17:25.257812 #49036] DEBUG -- : [Redis] command=WATCH args="resque:delayed:1703722645" D, [2023-12-27T16:17:25.258041 #49036] DEBUG -- : [Redis] call_time=0.20 ms Redis#srem will always return an Integer in Redis 5.0.0. Use Redis#srem? instead.(called from: /Users/alex.vondrak/.local/share/rtx/installs/ruby/3.2.2/lib/ruby/gems/3.2.0/gems/redis-namespace-1.11.0/lib/redis/namespace.rb:564:in `wrapped_send') Pipelining commands on a Redis instance is deprecated and will be removed in Redis 5.0.0. redis.multi do redis.get("key") end should be replaced by redis.multi do |pipeline| pipeline.get("key") end (called from /Users/alex.vondrak/.local/share/rtx/installs/ruby/3.2.2/lib/ruby/gems/3.2.0/gems/redis-namespace-1.11.0/lib/redis/namespace.rb:564:in `wrapped_send'} D, [2023-12-27T16:17:25.262883 #49036] DEBUG -- : [Redis] command=MULTI args= D, [2023-12-27T16:17:25.262965 #49036] DEBUG -- : [Redis] command=SREM args="resque:timestamps:{\"class\":\"Job\",\"args\":[],\"queue\":\"jobs\"}" "delayed:1703722645" D, [2023-12-27T16:17:25.263009 #49036] DEBUG -- : [Redis] command=SADD args="resque:queues" ["jobs"] D, [2023-12-27T16:17:25.263042 #49036] DEBUG -- : [Redis] command=RPUSH args="resque:queue:jobs" "{\"class\":\"Job\",\"args\":[]}" D, [2023-12-27T16:17:25.263076 #49036] DEBUG -- : [Redis] command=LTRIM args="resque:delayed:1703722645" 100 -1 D, [2023-12-27T16:17:25.263101 #49036] DEBUG -- : [Redis] command=EXEC args= D, [2023-12-27T16:17:25.263974 #49036] DEBUG -- : [Redis] call_time=0.82 ms D, [2023-12-27T16:17:25.264210 #49036] DEBUG -- : [Redis] command=WATCH args="resque:delayed:1703722645" D, [2023-12-27T16:17:25.264548 #49036] DEBUG -- : [Redis] call_time=0.30 ms D, [2023-12-27T16:17:25.264617 #49036] DEBUG -- : [Redis] command=LLEN args="resque:delayed:1703722645" D, [2023-12-27T16:17:25.264860 #49036] DEBUG -- : [Redis] call_time=0.22 ms D, [2023-12-27T16:17:25.264995 #49036] DEBUG -- : [Redis] command=MULTI args= D, [2023-12-27T16:17:25.265023 #49036] DEBUG -- : [Redis] command=DEL args="resque:delayed:1703722645" D, [2023-12-27T16:17:25.265051 #49036] DEBUG -- : [Redis] command=ZREM args="resque:delayed_queue_schedule" 1703722645 D, [2023-12-27T16:17:25.265073 #49036] DEBUG -- : [Redis] command=EXEC args= D, [2023-12-27T16:17:25.265405 #49036] DEBUG -- : [Redis] call_time=0.30 ms D, [2023-12-27T16:17:25.265527 #49036] DEBUG -- : [Redis] command=ZRANGEBYSCORE args="resque:delayed_queue_schedule" "-inf" 1703722645 "LIMIT" 0 1 D, [2023-12-27T16:17:25.265754 #49036] DEBUG -- : [Redis] call_time=0.20 ms ```
Redis 5 ```console $ REDIS_VERSION=5 ruby repro.rb [call] flushdb [call] rpush resque:delayed:1703722657 {"class":"Job","args":[],"queue":"jobs"} [call] sadd resque:timestamps:{"class":"Job","args":[],"queue":"jobs"} delayed:1703722657 [call] zadd resque:delayed_queue_schedule 1703722657 1703722657 [call] zrangebyscore resque:delayed_queue_schedule -inf 1703722657 LIMIT 0 1 resque-scheduler: [INFO] 2023-12-27T16:17:37-08:00: Processing Delayed Items [call] info [call] script load if redis.call('SETNX', KEYS[1], ARGV[1]) == 1 then redis.call('EXPIRE', KEYS[1], 180) return 1 else return 0 end [call] evalsha 8f4a4e422bfa8f9a0931e350d6e30b1c3ce97c33 1 resque:resque:resque_scheduler_master_lock localhost:49123 [call] lrange resque:delayed:1703722657 0 99 [call] watch resque:delayed:1703722657 [call_pipelined] sadd resque:queues jobs rpush resque:queue:jobs {"class":"Job","args":[]} [call_pipelined] MULTI srem resque:timestamps:{"class":"Job","args":[],"queue":"jobs"} delayed:1703722657 ltrim resque:delayed:1703722657 100 -1 EXEC [call] watch resque:delayed:1703722657 [call] llen resque:delayed:1703722657 [call_pipelined] MULTI del resque:delayed:1703722657 zrem resque:delayed_queue_schedule 1703722657 EXEC [call] zrangebyscore resque:delayed_queue_schedule -inf 1703722657 LIMIT 0 1 ```

Annoyingly, the log formats aren't the same between versions. But the salient point is that the Redis commands are actually executed a bit differently in Redis 4 vs Redis 5.

With Redis 4, the enqueue in the nested pipeline (from within the resque gem) gets added to the MULTI/EXEC already in progress that resque-scheduler is wrapping around each batch. Thus, the queue push happens inside of the transaction:

MULTI
SREM resque:timestamps:{"class":"Job","args":[],"queue":"jobs"} delayed:1703722645
SADD resque:queues jobs
RPUSH resque:queue:jobs {"class":"Job","args":[]}
LTRIM resque:delayed:1703722645 100 -1
EXEC

With Redis 5, the nested pipeline appears to fire eagerly, since it's effectively being called on the top-level like Resque.redis.pipelined { ... } instead of on the block argument like Resque.redis.multi { |m| m.pipelined { ... } }. Thus, the queue push happens outside of the transaction:

SADD resque:queues jobs
RPUSH resque:queue:jobs {"class":"Job","args":[]}
MULTI
SREM resque:timestamps:{"class":"Job","args":[],"queue":"jobs"} delayed:1703722657
LTRIM resque:delayed:1703722657 100 -1
EXEC

Note that the SADD + RPUSH are still sent as one unit in a pipelined command, but they're still apparently sent before the MULTI/EXEC block, meaning that the enqueue could still go through even if the schedule bookkeeping gets rolled back.

I'm guessing that this is not the desired behavior. I'm not totally sure of the implications, but it seems fairly serious. As such, I'll amend the title to reflect the broader nature of the issue.

murdoch commented 6 months ago

Is there any update on how to fix this issue? I've found that setting Redis.silence_deprecations = true does not resolve it.

PS - using the following:

alxckn commented 1 month ago

We've been observing job duplication on our app (with redis-rb 5.x), very probably related to this issue. As mentioned by @ajvondrak this seems like a serious problem indeed.

Fixing this issue would require patching both resque and resque-scheduler to pass down the transaction object, so quite invasive and complex... This would not solve issues for enqueue hooks that use redis though. Perhaps a revert of https://github.com/resque/resque-scheduler/pull/767 should be considered?