que-rb / que

A Ruby job queue that uses PostgreSQL's advisory locks for speed and reliability.
MIT License
2.31k stars 188 forks source link

Que jobs running concurrently have interference on Rails 5. #166

Closed bramski closed 7 years ago

bramski commented 8 years ago

Hello,

We're having a consistently repeatable problem over here of what we think is a fundamental problem of the MVCC advisory lock model in use here.

I see this code, and it doesn't seem to be effective: https://github.com/chanks/que/blob/master/lib/que/job.rb#L88

Screenshot of our logs: as_a_user_i_want_lucy_to_deal_with_facebook_sending_the_same_message_twice_so_that_i_don_t_have_a_confused_conversation__-_lucy_-_confluence

Our Job Runner Looks like:

require 'que'

module ActiveJob
  module QueueAdapters
    class ImhrJobRunner < ::Que::Job
      # Places the que job in job_data["que_ref"]
      # which can be picked up by ApplicationJob
      def run(job_data)
        @job_data = job_data

        if que_job_done? # we only need to check cause Que picks up destroyed jobs.
          log_rerun_issue_info
        else
          job_data["que_ref"] = self
          ApplicationJob.execute job_data
        end

        mark_que_job_as_done # if you reach this point it's done
      end

      public :destroy

      private

      attr_reader :job_data

      def log_rerun_issue_info
        Rails.logger.warn "#{self.class.name}: Que tried to rerun #{job_data["job_class"]}: " \
                          "#{job_data["job_id"]}"
      end

      def que_job_done?
        Rails.cache.exist?(que_job_cache_key)
      end

      def mark_que_job_as_done
        Rails.cache.write(que_job_cache_key, true, expires_in: 10.seconds)
      end

      def que_job_cache_key
        ["que-job-done", job_data["job_id"]].join("-")
      end
    end
  end
end

We've added Rails cache which uses redis to ensure that the jobs aren't being run twice.

bramski commented 8 years ago

Does the WARNING: you don't own a lock type of ExclusiveLock have something to do with this?

chanks commented 8 years ago

That warning message would come up if you try to release an advisory lock that hasn't been taken by that connection (it might also come up in other cases). Are you using pgbouncer, or some other connection pool outside the app? That may cause that message. Or, are you doing anything unusual with your PG connections?

Are you sure the job is being destroyed after it is run? Que tries to destroy jobs on its own that you haven't explicitly called destroy on, but it uses a @destroyed instance variable in the job instance itself to determine whether that is necessary, and your logic may be writing to that variable? Or possibly you're calling destroy in a savepoint that's getting rolled back while the main transaction continues to completion? The fact that you've marked destroy as public is a little suspicious to me, but I don't know what your actual job logic looks like.

bramski commented 8 years ago

It's just Que & AR. No pgbouncer. Destroy was made public so that we can put it inside the transaction explicitly and expose it to the AJ job. The transaction isn't cancelling or getting rolled back. In fact in the log you're seeing there is no transaction on the job which is repeating. Adding a transaction to that job made it less consistently repeatable, but still repeatable.

bramski commented 8 years ago

The job is definitely destroyed. We added an AR model to check it's existence when being run and it's typically not there. About 80%, but it's still possible that the AR connection thinks the job is there. I'm guessing due to the same MVCC problem that the code states. Redis-style locking manages to be consistent here.

joevandyk commented 8 years ago

Is this running in Rails production mode?

bramski commented 8 years ago

Seeing it in staging and production.

bramski commented 8 years ago

We've tuned the que worker to be more rapidly responsive. We were seeing this rather intermittently and then tuned the worker wake interval to 0.01 and started seeing this much more consistently.

chanks commented 8 years ago

That's not a ridiculous wake interval. I still suspect that there's something unusual about your setup, or perhaps your Postgres configuration? Have you changed the default transaction isolation, or are you somehow running the workers themselves inside of a transaction?

I doubt that there's anything fundamentally unexpected about advisory locks and how they interact with MVCC, or it would have come up sometime in the last three years. Of course, if you had a self-contained reproduction that demonstrated the issue, that would be helpful.

chanks commented 8 years ago

Also, I don't know what's causing those "you don't own a lock type of ExclusiveLock" warnings, those aren't a normal thing and are a signifier that something is wrong.

bramski commented 8 years ago

Postgres is vanilla heroku. Version: PostgreSQL 9.5.2 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit

We haven't done anything to the transaction isolation. Procfile:

web: bundle exec puma -C config/puma.rb
worker: bundle exec que --worker-count $WORKER_QUE_WORKERS --wake-interval $WORKER_QUE_WAKE_INTERVAL ./config/environment.rb
bramski commented 8 years ago

Postgres Settings:
postgres_settings.txt

it's actually a csv. so rename and open. GH doesn't seem to like csv files.

bramski commented 8 years ago

Looks like read_committed is the default for the heroku PG instances.

joevandyk commented 8 years ago

Is there anything in your code/job that could be starting new database connections? Forking processes for example? Or stopping processes?

That warning indicates that the advisory lock is being dropped and so it would make sense that Que would re-run the job.

bramski commented 8 years ago

You are looking at all the code pre-AJ @joevandyk . Nothing fancy after that. Just business logic.

bramski commented 8 years ago

We haven't changed the Que.mode, but we are running a separate que process.

bramski commented 8 years ago

Correction: Que.mode = :off in our staging environment.

bramski commented 8 years ago

I can repro this consistently locally on my own postgres. Setting concurrent worker count to 1 this doesn't repro itself. So I suspect this is because the unlock is happening on a different connection. I'm on rails 5.0.0 and using the AR adapter.

joevandyk commented 8 years ago

Hm.. I haven't tried Rails 5 yet. Wonder if 5 is doing something different?

bramski commented 8 years ago

WRT to the active record adapter it probably is.

bramski commented 8 years ago

It appears that work checks out a connection... but every internal call to execute also checks out a new connection as well.

chanks commented 8 years ago

Sounds like our interface to the connection pool broke somehow under Rails 5.

On Monday, July 25, 2016, Bram Whillock notifications@github.com wrote:

It appears that work checks out a connection... but every internal call to execute also checks out a new connection as well.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/chanks/que/issues/166#issuecomment-235118149, or mute the thread https://github.com/notifications/unsubscribe-auth/AASUGpsyOxxWG5savLJJIUOGd2xez5mFks5qZUdwgaJpZM4JUVge .

bramski commented 8 years ago

Rails 5 has taken a more predictable approach than previous comments make me think. Everytime you checkout it gives you a different connection.

bramski commented 8 years ago

And the pool checks out new connections for different threads; but then will give whichever connection is available to your thread next you checkout. So the more threads you have, the less likely the next checkout will give you the same thread.

bramski commented 8 years ago

I'll do my best to refactor and issue a PR that makes sense.

chanks commented 8 years ago

I'm on my phone so I can't check, but the AR adapter uses with_connection, right? Does that not do the same thing anymore? Or is it no longer reentrant?

On Monday, July 25, 2016, Bram Whillock notifications@github.com wrote:

I'll do my best to refactor and issue a PR that makes sense.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/chanks/que/issues/166#issuecomment-235121111, or mute the thread https://github.com/notifications/unsubscribe-auth/AASUGk08RkiGmIlBKOETn27Uy0JZVhKPks5qZUt_gaJpZM4JUVge .

bramski commented 8 years ago

It appears to no longer be reentrant.

bramski commented 8 years ago

I've attempted it with ActiveRecord::Base.connection and that doesn't seem to yield the proper result either. The connection pool reaper will recycle that connection if your worker does blocking IO.

bramski commented 8 years ago

Ensuring that the worker loop uses the same connection seems paramount to actually executing the advisory locks. Why not refactor the code in such a way that ensures that behavior?

joevandyk commented 8 years ago

Might be worth it somehow to setup a concurrent integration test.

On Monday, July 25, 2016, Chris Hanks notifications@github.com wrote:

Sounds like our interface to the connection pool broke somehow under Rails 5.

On Monday, July 25, 2016, Bram Whillock <notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

It appears that work checks out a connection... but every internal call to execute also checks out a new connection as well.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/chanks/que/issues/166#issuecomment-235118149, or mute the thread < https://github.com/notifications/unsubscribe-auth/AASUGpsyOxxWG5savLJJIUOGd2xez5mFks5qZUdwgaJpZM4JUVge

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/chanks/que/issues/166#issuecomment-235120178, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAEm9QtDvTVDtaBchSJ1OjA4r180uQTks5qZUpCgaJpZM4JUVge .

bramski commented 8 years ago

https://github.com/chanks/que/pull/167 should fix this. Not my favorite fix; but it avoids doing a major refactor that would prevent all of the re-entrant checkouts that are currently done on the adapter.

bramski commented 8 years ago

There are other problems in master that prevent that fix from working well. Internal errors cause the connections not to be cleaned up and then the pool is stuck without a way to checkout a new connection. Please let me know if you guys will have time to fix the problems or else I will have to take up this work myself.

chanks commented 8 years ago

I've long since started moving apps off of ActiveRecord, and I'm not familiar with the changes that have gone into it for 5.0 (I've been relying on pull requests for Rails-specific functionality for a while now). I'm also not really willing to completely refactor the worker system to support it, when a reentrant method that yields a connection for the current thread is a much simpler API to rely on (and one that 1.0 relies even more heavily on, btw).

That said, there must be some way to get this behavior from ActiveRecord 5.0 without too much code.

bramski commented 8 years ago

This issue just ran my PG heroku instance out of memory. At about 3am. Not awesome. Any update @chanks ?

chanks commented 8 years ago

Haven't had time to look into this. Input from someone who knows ActiveRecord 5 internals would be appreciated.

On Friday, August 5, 2016, Bram Whillock notifications@github.com wrote:

This issue just ran my PG heroku instance out of memory. At about 3am. Not awesome. Any update @chanks https://github.com/chanks ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/chanks/que/issues/166#issuecomment-237868433, or mute the thread https://github.com/notifications/unsubscribe-auth/AASUGpBUOxfs5yllEeJxkhJ5xLdLUH5nks5qc0tagaJpZM4JUVge .

ericboehs commented 8 years ago

I'm seeing this problem as well. I really enjoy Que + AR and hope this can be resolved. @bramski Have you had any luck?

bramski commented 8 years ago

Zero. We had to write a redis workaround. We will switch to rabbit mq. No solutions presented here for rails 5.

On Aug 30, 2016 4:26 PM, "Eric Boehs" notifications@github.com wrote:

I'm seeing this problem as well. I really enjoy Que + AR and hope this can be resolved. @bramski https://github.com/bramski Have you had any luck?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/chanks/que/issues/166#issuecomment-243613154, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZvldnxa7_BmPJiEEaQKAiBGChVJTMYks5qlLwTgaJpZM4JUVge .

9mm commented 8 years ago

Damn... well this is really sad. I was really looking forward to using this on a really important project.

@bramski so your PR doesn't actually help the issue then?

bramski commented 8 years ago

Nope. We moved to sidekiq.

On Oct 7, 2016 6:34 PM, "Travis" notifications@github.com wrote:

Damn... well this is really sad. I was really looking forward to using this on a really important project.

@bramski https://github.com/bramski so your PR doesn't actually help the issue then?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/chanks/que/issues/166#issuecomment-252390322, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZvlWHE2jjWGKgvWwWnT5PT5e5poiDPks5qxuUagaJpZM4JUVge .

9mm commented 8 years ago

Is there a way to do super reliable exactly-once processing with sidekiq? Everything I read makes it sound amazing for performance and whatnot, but I have stuff thats absolutely critical to be ran on time, reliably (dealing with payment processing). That's why I liked the ACID guarantees of Que (and Delayed Job is too slow)

bramski commented 8 years ago

Not of importance to us.

On Oct 9, 2016 4:55 PM, "Travis" notifications@github.com wrote:

Is there a way to do super reliable exactly-once processing with sidekiq? Everything I read makes it sound amazing for performance and whatnot, but I have stuff thats absolutely critical to be ran on time, reliably (dealing with payment processing). That's why I liked the ACID guarantees of Que (and Delayed Job is too slow)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/chanks/que/issues/166#issuecomment-252519067, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZvlRJZn_A_WP5KW5sbbrfTR-VTldxyks5qyXDagaJpZM4JUVge .

chanks commented 8 years ago

Sidekiq Pro offers this guarantee. I've never dug that far into it's implementation, but we use it in an app at my day job and there haven't been problems with it that I recall.

If you're interested in a job queue with a larger community that can offer better and easier Rails/ActiveRecord integration, and ACID guarantees aren't important to you, it's a great choice!

On Sunday, October 9, 2016, Bram Whillock notifications@github.com wrote:

Not of importance to us.

On Oct 9, 2016 4:55 PM, "Travis" <notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

Is there a way to do super reliable exactly-once processing with sidekiq? Everything I read makes it sound amazing for performance and whatnot, but I have stuff thats absolutely critical to be ran on time, reliably (dealing with payment processing). That's why I liked the ACID guarantees of Que (and Delayed Job is too slow)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/chanks/que/issues/166#issuecomment-252519067, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZvlRJZn_A_ WP5KW5sbbrfTR-VTldxyks5qyXDagaJpZM4JUVge .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/chanks/que/issues/166#issuecomment-252519357, or mute the thread https://github.com/notifications/unsubscribe-auth/AASUGpXj2y3BFYBEdfvtbn7kij3vFM5Qks5qyXIRgaJpZM4JUVge .

bgentry commented 8 years ago

The Rails docs still suggest that the with_connection behavior is unchanged. The only changes to that code I see are from @thedarkone but I don't see any obvious problems that would have broken this behavior. https://github.com/rails/rails/commit/603fe20c0b8c05bc1cca8d01caadbd7060518141

thedarkone commented 8 years ago

Since I've been pinged: with_connection is reentrant.

ActiveRecord::Base.connection_pool.with_connection do |conn_outer|
  ActiveRecord::Base.connection_pool.with_connection do |conn_inner|
    conn_outer == conn_inner # => true

    ActiveRecord::Base.connection_pool.with_connection do |another_conn_inner|
      conn_outer == another_conn_inner # => true
      conn_outer == conn_inner # => true
      conn_inner == another_conn_inner # => true
    end
  end

  ActiveRecord::Base.connection_pool.with_connection do |repeated_conn_inner|
    conn_outer == repeated_conn_inner # => true
  end
end

Leaving the outer-most with_connection doesn't guarantee that next time the pool will provide the same connection (obviously):

ActiveRecord::Base.connection_pool.with_connection do |conn|
  @leaked_conn = conn
end

ActiveRecord::Base.connection_pool.with_connection do |conn|
  @leaked_conn == conn # => might be true or false, depends on the conn pool state/other threads etc.
end

Note that manually returning a conn to the pool breaks with_connection contract:

ActiveRecord::Base.connection_pool.with_connection do |conn|
  # breaking with_connection contract:
  ActiveRecord::Base.connection_pool.checkin conn
  # see also pool.release_connection, pool.remove(conn), or ActiveRecord::Base.clear_active_connections!

  ActiveRecord::Base.connection_pool.with_connection do |conn_inner|
    conn_inner == conn # => might be true or false, depends on the conn pool state/other threads etc.
  end
end

My beast guess at what is happening: 3-rd party or app code, is manually releasing/returning the connection to the pool.

chanks commented 8 years ago

Hmm, that's interesting. Thanks for the reassurance, @thedarkone.

This doesn't seem to be a Que issue, so I'm going to close it. If anyone has any ideas on how to (cleanly) prevent this kind of occurrence, I'm open to hearing them, but I think it's reasonable for us to assume that with_connection is going to be both reentrant and hold the connection for the length of the block, since it's designed that way. Anyone else's code that breaks that contract is probably behaving irresponsibly, I would say.

bramski commented 8 years ago

This is reproduceable with vanilla rails 5 and que. Why would you close it? The issue is not fixed.

On Nov 7, 2016 7:22 AM, "Chris Hanks" notifications@github.com wrote:

Closed #166 https://github.com/chanks/que/issues/166.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/chanks/que/issues/166#event-850037021, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZvlcTot0nnq_tYhLXsehfjMoX3sH5aks5q70I3gaJpZM4JUVge .

chanks commented 8 years ago

Has anyone put together a reproduction with vanilla Rails 5?

bramski commented 8 years ago

The original issue this was linked to was vanilla AR with que.

On Nov 7, 2016 7:34 AM, "Chris Hanks" notifications@github.com wrote:

Has anyone put together a reproduction with vanilla Rails 5?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/chanks/que/issues/166#issuecomment-258868104, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZvlVmbvmeWvCHwcIy0q57uty2AxwY9ks5q70UDgaJpZM4JUVge .

chanks commented 8 years ago

And no other gems? It looks like there was something in your setup that was breaking the with_connection contract.

thedarkone commented 8 years ago

I know nothing about Que, @chanks can you test it? Or if I wanted to poke at this, all I need is postgres based vanilla Rails 5 app and then I just follow the README.md's Installation/Usage sections, right?

chanks commented 8 years ago

@thedarkone yes, it should be pretty straightforward, though I haven't actually used Rails 5 myself, so it's possible something's missing.