Occasional deadlocks due to Rails convention for using mutexes

machty commented 4 years ago

Rails loads table/column info from the database, and this data is turn used to define dynamic methods backed by database columns. The code that performs this initial column load is wrapped in a mutex, which prevents thread safety bugs, but if you're using async-postgres, there's a nondeterministic chance you'll get bitten by a ThreadError: deadlock; recursive locking error in (at least) the following case:

Fiber 1 initializes/saves an ActiveRecord obj for the first time; the column metadata hasn't been loaded, so it locks the mutex, succeeds, and then goes on to load the column data from postgres, ultimately hitting async_exec, which async-postgres modifies to return control back to the reactor
Fiber 2 initializes/saves an ActiveRecord obj, the column metadata still isn't marked as loaded, so it attempts to lock the mutex, but it hits the deadlock because it's trying to lock the mutex from the same thread.

I'm not sure what the solution is here, wouldn't be surprised if this was biting falcon users too; note that I'm seeing this in a Sidekiq-ish style background worker that I wrote with the following libs:

gem "async", require: false
gem "async-http", require: false
gem 'async-redis', github: "socketry/async-redis", require: false
gem 'async-postgres', require: false

locked at the following versions

    async (1.19.0)
      console (~> 1.0)
      nio4r (~> 2.3)
      timers (~> 4.1)
    async-http (0.46.3)
      async (~> 1.19)
      async-io (~> 1.18)
      protocol-http (~> 0.8.0)
      protocol-http1 (~> 0.8.0)
      protocol-http2 (~> 0.9.0)
    async-io (1.23.3)
      async (~> 1.14)
    async-postgres (0.1.0)
      async (~> 1.3)
      pg

I don't really have a solution at this time, is there perhaps a way I can preload all of these column datas? I looked at the falcon repo but didn't see anyone else running into this particular deadlock.

machty commented 4 years ago

I can't verify this removes all deadlock conditions, but it's a start:

  if Rails.env.production?
    puts "Calling define_attribute_methods on all ActiveRecord models..."
    ActiveRecord::Base.descendants.each(&:define_attribute_methods)
    puts "Calling define_attribute_methods on all ActiveRecord models... DONE"
  end

I added this before starting the async engine. It adds about 4s to startup time on Heroku but seems to work.

ioquatix commented 4 years ago

ActiveRecord is not fiber safe unfortunately.

Using a mutex within fiber code can cause problems like this. @ko1 as we discussed, FYI. Maybe it's necessary to make Mutex fiber aware.

On the other hand, I would also say, that holding a mutex while issuing a database query could be considered a poor usage of mutex, because the duration is so long. I don't believe it's good idea to hold a mutex when doing I/O, although AR metadata load may be a valid use case.

One solution would be to detect when doing I/O within a mutex, and revert back to blocking behaviour. We also considered wrapping I/O operations in some kind of Fiber.exclusive method which can have the same effect.

As you know, Falcon is still experimental system. I appreciate you trying it out and reporting back.

eregon commented 4 years ago

IMHO this is a problem that the Mutex in MRI is currently held per Thread, but should be held per Fiber. Then Fiber 2 in your example would just wait for it, as it should.

The issue is then if those 2 Fibers are from the same Thread, Fiber 2 would block forever since that Mutex is taken by Fiber 1. To solve that I think we need to make Mutex fiber aware, or in the context of @ioquatix scheduler PR, call back to the scheduler if a Mutex cannot be acquired. Then I believe everything would work nicely for this case.

machty commented 4 years ago

What does it mean to "make Mutex fiber aware"? Does that mean changing Mutex in the Ruby standard lib?

eregon commented 4 years ago

I think ideally yes, and https://github.com/ruby/ruby/pull/1870 for example when introducing the scheduler could also call to the scheduler for Mutex#lock if the current Thread has a scheduler (otherwise just normal behavior).

Definitely needs a MRI change to have Mutex owned by Fibers and not Threads.

Not sure what async-postgres/io can do for this, @ioquatix ?

socketry / async-postgres

Occasional deadlocks due to Rails convention for using mutexes #12