openfoodfoundation / openfoodnetwork

Connect suppliers, distributors and consumers to trade local produce.
https://www.openfoodnetwork.org
GNU Affero General Public License v3.0
1.1k stars 714 forks source link

ActiveRecord::Deadlocked in api/v0/variants#create #12056

Open mkllnk opened 8 months ago

mkllnk commented 8 months ago

Description

Trying to create multiple variants simultaneously can fail. When importing and updating 800 products through n8n, it always failed at some point.

Expected Behavior

All requests succeed without failure.

Actual Behaviour

Deadlock error described below.

Steps to Reproduce

  1. Execute n8n workflow to create hundreds of products without batch size option. Example: https://n8n.openfoodnetwork.org.uk/workflow/128

Animated Gif/Screenshot

Workaround

Send requests in sequential order and wait for the result before sending the next request. This workaround is limited because you can't control other people's requests. You have to be the only one uploading for one enterprise at a time.

Severity

Your Environment

Possible Fix

We can lock required records before the execution. Or we could rework our caching logic to not need as many touch events.

Error in OpenFoodNetwork Aus

ActiveRecord::Deadlocked in api/v0/variants#create PG::TRDeadlockDetected: ERROR: deadlock detected DETAIL: Process 6375 waits for ShareLock on transaction 35670285; blocked by process 6378. Process 6378 waits for ShareLock on transaction 35670288; blocked by process 890. Process 890 waits for ExclusiveLock on tuple (474,7) of relation 18092 of database 18001; blocked by process 6375. HINT: See server log for query details. CONTEXT: while updating tuple (474,7) in relation "enterprises"

View on Bugsnag

Stacktrace

app/controllers/api/v0/variants_controller.rb:24 - create

View full stacktrace

Created by Maikel via Bugsnag

mkllnk commented 8 months ago

This issue has been linked to an error in Bugsnag ActiveRecord::Deadlocked in api/v0/variants#create

This is the other half of the deadlock:

  # app/models/enterprise.rb:586
  def touch_distributors
    Enterprise.distributing_products(supplied_products.select(:id)).
      where.not(enterprises: { id: }).
      update_all(updated_at: Time.zone.now)
  end
cyrillefr commented 6 months ago

Hello @mkllnk ,

I would like to help, even though I don't know if I can reproduce and work locally (I have no n8n & bugsnag account). Is it possible ?

mkllnk commented 6 months ago

Yes, you can help. The error appears when two API requests try to create a variant for the same product at the same time.

If you want to reproduce it properly, you can write a spec that sets a breakpoint and starts two threads. Here's an example: https://github.com/openfoodfoundation/openfoodnetwork/blob/af9f07f496052ff77b033bd8e6e49e8803e89f0f/spec/jobs/subscription_placement_job_spec.rb#L183-L227

cyrillefr commented 6 months ago

Hello @mkllnk ,

I have tried to reproduce it but did not succeed.

If I launch 1500 threads of a POST on v0/variant#create, I do have an error, but not this one :)

If I launch 1200 threads, the create succeeds, but I do have some missing variants though. And if I have in_batches in app/models/enterprise.rb in the touch_distributors method, I do get all my 1200 variants (though I don't know why some are missing if I don't batche, no error raised).

But I cannot replicate the deadlock. I don't understand how I can use a mutex to have 2 lines trying to get written at the exact same time in the DB and raise a Deadlock. (If I remove the mutexes from your example, test still passes).

From what I have read on stackoverflow, with an error like that, in_batches could protect from the deadlock.

mkllnk commented 6 months ago

Did you set the flag concurrency: true on the spec? Without that flag, RSpec runs every test example in a transaction and since the transaction is rolled back at the end of the spec, the data is never committed to the database and conflicts never appear. With that flag, we deactivate the transactions and the threads can run into race conditions. The trick is for both threads to pause within their own transaction and then release the lock.

In my example, modifying the spec doesn't necessarily make it fail. You need to change the code to make it fail:

diff --git a/app/services/place_proxy_order.rb b/app/services/place_proxy_order.rb
index aaa88f620d..fabcefa83b 100644
--- a/app/services/place_proxy_order.rb
+++ b/app/services/place_proxy_order.rb
@@ -35,11 +35,11 @@ class PlaceProxyOrder
   attr_accessor :order

   def place_order
-    proxy_order.with_lock do
+    #proxy_order.with_lock do
       return if proxy_order.placed_at.present?

       initialise_order
-    end
+    #end
   end

   def initialise_order

This spec just checks that only one order is placed though. It doesn't test for a deadlock. That would appear if you have many proxy orders and if they got loaded in a random order. The current code contains order(:id) though to guarantee deterministic execution and prevent deadlocks.

cyrillefr commented 6 months ago

Hello @mkllnk , thanks for the tip, I understand the why of your code.

But here, I have no clue of how to get to a race condition. A creation of a variant touches a product which touches an Enterprise. All this with callbacks. I guess these operations must be synchronized and at one moment it is not the case in production. But I don't know what to do with the final update_all in Enterprise. I don't know how to spot an inaccuracy. Would there be some more info in logs ?

mkllnk commented 4 months ago

Sorry for the long delay on this. We don't have any more information in the logs. I do know that an update_all can be problematic though. If you don't apply any ordering to the records and you update, for example, records 1, 2 and 3 then one update call may tray to update 1, 3, 2 and another may update 3, 1, 2. If that happens at the same time then one process locks row 1 while the other locks row 3. Then the first process wants to lock row 3 but needs to wait for the other process, already locked. The other process wants to lock row 1 but that's locked by the first process and we get a deadlock.

This can be resolved by first locking the rows in order. I don't know if you can apply order by in an update call, I guess not. So you first need to do a select with a lock for writing and then you can update.