andrewjbtw commented 1 year ago

The digitized serials workflow involves associating multiple druids with a single catalog record. When releasing more than one druid at the same time (via bulk action) to the same catalog record, only one of the jobs succeeds. The other hits an error.

Example: these two druids are associated with the same catalog record https://argo-qa.stanford.edu/catalog?f%5Bexploded_tag_ssim%5D%5B%5D=856+%3A+serial&f%5Btag_ssim%5D%5B%5D=Project+%3A+856

To release at the same time:

Click on the "bulk action" button to the right above the search result list
Choose "new bulk action"
Choose "manage release" (this is the top of the list, so should be selected already
Keep "release this object" selected
Choose "Populate from previous search" to select both druids
Submit the job
Go back to the search
Wait for a while
Reload the search results
See that the second item has a 500 error
See HB errors as well

It looks like this is happening for both "release" and "unrelease". I'm guessing this is because only one edit can be made at a time.

mjgiarlo commented 1 year ago

Relevant HB alert: https://app.honeybadger.io/projects/50568/faults/94638749

andrewjbtw commented 1 year ago

This has stopped working. If I release the two items in the example in the ticket, only one succeeds. The other one works if I wait and then re-release. I see the same problem with un-releasing. Only one job succeeds.

mjgiarlo commented 1 year ago

@andrewjbtw Well, that wasn't supposed to happen. 😆

I'll re-open and check it out after lunch. Should I use the same test objects above?

mjgiarlo commented 1 year ago

@andrewjbtw I tried the release bulk action on the above two test objects and it seemed to work fine (releaseWF in both completed). Should I use diff test objects? Or run it n times in a row?

andrewjbtw commented 1 year ago

Let me try it with just those two. I was trying it with a larger batch of unrelated records and it never worked that way.

andrewjbtw commented 1 year ago

I just tried a few times and both druids had 856s successfully created only once from the same bulk job. More commonly, only one druid got an 856 for release, or only one 856 was removed going the other direction. I did individual druid releases to bring them into sync - both released, or both not released - in between tests.

The one that succeeded was the job that seemed to take the longest, measured by how long the druid had status "started" for update-marc. Maybe it needs more retries? Or a setting where it retries until it recognizes success?

mjgiarlo commented 1 year ago

@andrewjbtw OK, I'll keep using these test objects and see if I can get them into the bad state. Yeah, I was already thinking this might be in the "bump the timeout value" territory. Thanks for your testing.

andrewjbtw commented 1 year ago

Surprisingly, I never saw an error. The only sign was what happened to the 856.

mjgiarlo commented 1 year ago

Now that I think I understand what you're seeing, I wonder if this isn't about needing to change a timeout but about Folio locking not working as we'd expect it.

mjgiarlo commented 1 year ago

How to Reproduce the Bug

Via Argo UI

Search for Argo QA items that share a catalog record id (example: a666)
If the items have already been released to SearchWorks, initiate the unrelease bulk action. If they're unreleased, initiate the release bulk action. See steps in PR description
Watch the bulk action complete with no errors
The items will now run through the release workflow. Watch that update live in the workflow grid view
When the release workflow has completed for both items, look at which 856s made it into Folio. Notice they're not the ones you expected.

VIA DSA Console

☝🏻 is a lot of clicking and waiting, so I worked on a DSA branch to make this a little quicker and easier to test. That's concurrent-856-edits#4454, which you can deploy to QA. Once deployed, SSH to any DSA QA box and jump into the Rails console. Once there, paste in this Ruby code to make it easier to tweak the two test objects (by flipping the SW release flag on and off and storing the changed Cocina):

def flip_last_release_tag(release_tags)
  release_tags[0...-1] + [release_tags.last.new(release: !release_tags.last.release)]
end

def release_unrelease!
  cocina_object1 = CocinaObjectStore.find('druid:py799sb2220')
  cocina_object2 = CocinaObjectStore.find('druid:dv133mk4688')

  CocinaObjectStore.store(
    cocina_object1.new(
      administrative: cocina_object1.administrative.new(
        releaseTags: flip_last_release_tag(cocina_object1.administrative.releaseTags)
      )
    ), skip_lock: false
  )

  CocinaObjectStore.store(
    cocina_object2.new(
      administrative: cocina_object2.administrative.new(
        releaseTags: flip_last_release_tag(cocina_object2.administrative.releaseTags)
      )
    ), skip_lock: false
  )

  puts "Setting release to: #{[CocinaObjectStore.find('druid:py799sb2220').administrative.releaseTags.last.release, CocinaObjectStore.find('druid:dv133mk4688').administrative.releaseTags.last.release]}"
end

And then paste in this code that flips the Cocina release tag and then reproduces the problem:

release_unrelease!

Parallel.map(
  [CocinaObjectStore.find('druid:py799sb2220'), 
   CocinaObjectStore.find('druid:dv133mk4688')]
  ) do |cocina_object| 
  Catalog::UpdateMarc856RecordService.update(cocina_object, thumbnail_service: ThumbnailService.new(cocina_object)) 
end

Assuming you deployed the concurrent-856-edits#4454 branch, you should get diagnostic output like the following, if unreleasing:

Setting release to: [false, false]

druid:dv133mk4688: writing to Folio
druid:py799sb2220: writing to Folio
druid:dv133mk4688: initial state has these 856s:
[]
druid:dv133mk4688: removing this 856
[]
druid:dv133mk4688: unreleasing to SW, so not appending
druid:dv133mk4688: new state has these 856s
[]
druid:py799sb2220: initial state has these 856s:
[]
druid:py799sb2220: removing this 856
[]
druid:py799sb2220: unreleasing to SW, so not appending
druid:py799sb2220: new state has these 856s
[]
druid:dv133mk4688: finished updating
Retrying Folio client operation for druid:py799sb2220 (1 tries)
druid:py799sb2220: exception: Net::ReadTimeout with #<TCPSocket:(closed)>
druid:py799sb2220: initial state has these 856s:
[]
druid:py799sb2220: removing this 856
[]
druid:py799sb2220: unreleasing to SW, so not appending
druid:py799sb2220: new state has these 856s
[]
druid:py799sb2220: finished updating

or like this if releasing:

Setting release to: [true, true

druid:py799sb2220: writing to Folio
druid:dv133mk4688: writing to Folio
druid:dv133mk4688: initial state has these 856s:
[]
druid:dv133mk4688: removing this 856
[]
druid:dv133mk4688: releasing to SW, so appending {"tag"=>"856", "isProtected"=>false, "indicators"=>["4", "1"], "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/dv133mk4688 $x SDR-PURL $x item $x file:dv133mk4688%2Fpage_0001.jp2 $x label:Volume 1, number 1, October 1972 $x sort:1.1 $x rights:group=stanford"}
druid:dv133mk4688: new state has these 856s
[{"tag"=>"856", "isProtected"=>false, "indicators"=>["4", "1"], "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/dv133mk4688 $x SDR-PURL $x item $x file:dv133mk4688%2Fpage_0001.jp2 $x label:Volume 1, number 1, October 1972 $x sort:1.1 $x rights:group=stanford"}]
druid:py799sb2220: initial state has these 856s:
[]
druid:py799sb2220: removing this 856
[]
druid:py799sb2220: releasing to SW, so appending {"tag"=>"856", "isProtected"=>false, "indicators"=>["4", "1"], "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/py799sb2220 $x SDR-PURL $x item $x file:py799sb2220%2F1592001.jp2 $x label:Volume 1, number 2, October 1972 $x sort:1.2 $x rights:group=stanford"}
druid:py799sb2220: new state has these 856s
[{"tag"=>"856", "isProtected"=>false, "indicators"=>["4", "1"], "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/py799sb2220 $x SDR-PURL $x item $x file:py799sb2220%2F1592001.jp2 $x label:Volume 1, number 2, October 1972 $x sort:1.2 $x rights:group=stanford"}]
druid:py799sb2220: finished updating
Retrying Folio client operation for druid:dv133mk4688 (1 tries)
druid:dv133mk4688: exception: Net::ReadTimeout with #<TCPSocket:(closed)>
druid:dv133mk4688: initial state has these 856s:
[{"tag"=>"856", "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/dv133mk4688 $x SDR-PURL $x item $x file:dv133mk4688%2Fpage_0001.jp2 $x label:Volume 1, number 1, October 1972 $x sort:1.1 $x rights:group=stanford", "indicators"=>["4", "1"], "isProtected"=>false}]
druid:dv133mk4688: removing this 856
[{"tag"=>"856", "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/dv133mk4688 $x SDR-PURL $x item $x file:dv133mk4688%2Fpage_0001.jp2 $x label:Volume 1, number 1, October 1972 $x sort:1.1 $x rights:group=stanford", "indicators"=>["4", "1"], "isProtected"=>false}]
druid:dv133mk4688: releasing to SW, so appending {"tag"=>"856", "isProtected"=>false, "indicators"=>["4", "1"], "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/dv133mk4688 $x SDR-PURL $x item $x file:dv133mk4688%2Fpage_0001.jp2 $x label:Volume 1, number 1, October 1972 $x sort:1.1 $x rights:group=stanford"}
druid:dv133mk4688: new state has these 856s
[{"tag"=>"856", "isProtected"=>false, "indicators"=>["4", "1"], "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/dv133mk4688 $x SDR-PURL $x item $x file:dv133mk4688%2Fpage_0001.jp2 $x label:Volume 1, number 1, October 1972 $x sort:1.1 $x rights:group=stanford"}]
druid:dv133mk4688: finished updating
 => [["a666"], ["a666"]]

You can see the bug in the latter paste when releasing, specifically here:

druid:dv133mk4688: initial state has these 856s:
[{"tag"=>"856", "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/dv133mk4688 $x SDR-PURL $x item $x file:dv133mk4688%2Fpage_0001.jp2 $x label:Volume 1, number 1, October 1972 $x sort:1.1 $x rights:group=stanford", "indicators"=>["4", "1"], "isProtected"=>false}]

The Folio write for py799sb2220 succeeds, at which point the record will have one 856 (for py799sb2220). The write for dv133mk4688 fails[^1], possibly because the write for py799sb2220 changed the version of the in-memory record, i.e., locking. What's important to note here is when the write for dv133mk4688 retries, it does get an incremented version of the record (verified by hand-jamming in a FolioClient patch on the server) but the 856s do not include the py799sb2220 856 so when the dv133mk4688 write succeeds next, it only contains the dv133mk4688 856.

[^1]: The failure is usually a timeout error but sometimes it is this error documented in the FolioClient gem, and it's unclear to me if this is just how Folio record locking works.

lwrubel commented 1 year ago

Thanks for setting up this branch and providing the helpful code to make it faster to test, @mjgiarlo.

Taking our folio_client out of the picture, and using the notebook @justinlittman created to test the quick-marc API, it is possible to set up a sequence of operations where Folio will return a 409 and a message that indicates that optimistic locking is the issue. Example message text:

{"message":"ERROR: Cannot update record 5108040a-65bc-40ed-bd50-265958301ce4 because it has been changed (optimistic locking): Stored _version is 24, _version of request is 23 (23F09)","type":"-3","code":"INTERNAL_SERVER_ERROR"}

Also in the notebook, trying multiple edits of the same record using Parallel, I get the Net::ReadTimeout error that was also showing up in DSA. Is there something about what happens with Parallel that is leading to the timeouts?

I tried bumping up the timeout config on the Faraday connection, but that didn't make a difference.

I also poked around the FOLIO Slack a bit, and while people do report seeing the duplicate key value violates unique constraint \"idx_records_matched_id_gen\" error, it's in the context of using the data import module and loading duplicate records (so, different API, but also related to source-record-storage).

mjgiarlo commented 1 year ago

@lwrubel 💬

Thanks for setting up this branch and providing the helpful code to make it faster to test, @mjgiarlo.

🚀

Taking our folio_client out of the picture, and using the notebook @justinlittman created to test the quick-marc API, it is possible to set up a sequence of operations where Folio will return a 409 and a message that indicates that optimistic locking is the issue. Example message text:
{"message":"ERROR: Cannot update record 5108040a-65bc-40ed-bd50-265958301ce4 because it has been changed (optimistic locking): Stored _version is 24, _version of request is 23 (23F09)","type":"-3","code":"INTERNAL_SERVER_ERROR"}

Ah, that's good to see! I wonder if the Folio client code needs tweaking so it can see the same 409 error?

Also in the notebook, trying multiple edits of the same record using Parallel, I get the Net::ReadTimeout error that was also showing up in DSA. Is there something about what happens with Parallel that is leading to the timeouts?

Not that I can think of. The use of Parallel is to approximate what's happening when users invoke a bulk action like above. We could try taking Parallel out of the mix and invoking the two updates in rapid succession to see if any behavior or outcomes change? (E.g., joining the operations with a &&?)

I tried bumping up the timeout config on the Faraday connection, but that didn't make a difference.

Interesting. I'm curious if you bumped the timeout, the wait, or both?

I also poked around the FOLIO Slack a bit, and while people do report seeing the duplicate key value violates unique constraint \"idx_records_matched_id_gen\" error, it's in the context of using the data import module and loading duplicate records (so, different API, but also related to source-record-storage).

Thanks for poking around! 🤔

lwrubel commented 1 year ago

If you do the updates in rapid succession:

cocina_object1 = CocinaObjectStore.find('druid:py799sb2220')
cocina_object2 = CocinaObjectStore.find('druid:dv133mk4688')
Catalog::UpdateMarc856RecordService.update(cocina_object1, thumbnail_service: ThumbnailService.new(cocina_object1)) && Catalog::UpdateMarc856RecordService.update(cocina_object2, thumbnail_service: ThumbnailService.new(cocina_object2))

then they release successfully.

Returning to the timeout error, I noticed when I added request: { timeout: 120 } when instantiating a Faraday connection, that the Parallel operation does not time out in the notebook. The record gets updated from multiple operations. request: { timeout: 60 } does not help make a difference. Added this in a branch for the folio_client and deployed to DSA QA. Now the diagnostic output does not show a timeout, but:

druid:py799sb2220: exception: The remote server returned an internal server error.

So still not a 409, but a 500?

druid:dv133mk4688: writing to Folio
druid:py799sb2220: writing to Folio
druid:py799sb2220: initial state has these 856s:
[{"tag"=>"856", "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/py799sb2220 $x SDR-PURL $x item $x file:py799sb2220%2F1592001.jp2 $x label:Volume 1, number 2, October 1972 $x sort:1.2 $x rights:group=stanford", "indicators"=>["4", "1"], "isProtected"=>false}]
druid:py799sb2220: removing this 856
[{"tag"=>"856", "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/py799sb2220 $x SDR-PURL $x item $x file:py799sb2220%2F1592001.jp2 $x label:Volume 1, number 2, October 1972 $x sort:1.2 $x rights:group=stanford", "indicators"=>["4", "1"], "isProtected"=>false}]
druid:py799sb2220: releasing to SW, so appending {"tag"=>"856", "isProtected"=>false, "indicators"=>["4", "1"], "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/py799sb2220 $x SDR-PURL $x item $x file:py799sb2220%2F1592001.jp2 $x label:Volume 1, number 2, October 1972 $x sort:1.2 $x rights:group=stanford"}
druid:py799sb2220: new state has these 856s
[{"tag"=>"856", "isProtected"=>false, "indicators"=>["4", "1"], "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/py799sb2220 $x SDR-PURL $x item $x file:py799sb2220%2F1592001.jp2 $x label:Volume 1, number 2, October 1972 $x sort:1.2 $x rights:group=stanford"}]
druid:dv133mk4688: initial state has these 856s:
[{"tag"=>"856", "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/py799sb2220 $x SDR-PURL $x item $x file:py799sb2220%2F1592001.jp2 $x label:Volume 1, number 2, October 1972 $x sort:1.2 $x rights:group=stanford", "indicators"=>["4", "1"], "isProtected"=>false}]
druid:dv133mk4688: removing this 856
[]
druid:dv133mk4688: releasing to SW, so appending {"tag"=>"856", "isProtected"=>false, "indicators"=>["4", "1"], "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/dv133mk4688 $x SDR-PURL $x item $x file:dv133mk4688%2Fpage_0001.jp2 $x label:Volume 1, number 1, October 1972 $x sort:1.1 $x rights:group=stanford"}
druid:dv133mk4688: new state has these 856s
[{"tag"=>"856", "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/py799sb2220 $x SDR-PURL $x item $x file:py799sb2220%2F1592001.jp2 $x label:Volume 1, number 2, October 1972 $x sort:1.2 $x rights:group=stanford", "indicators"=>["4", "1"], "isProtected"=>false}, {"tag"=>"856", "isProtected"=>false, "indicators"=>["4", "1"], "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/dv133mk4688 $x SDR-PURL $x item $x file:dv133mk4688%2Fpage_0001.jp2 $x label:Volume 1, number 1, October 1972 $x sort:1.1 $x rights:group=stanford"}]
druid:dv133mk4688: finished updating
Retrying Folio client operation for druid:py799sb2220 (1 tries)
druid:py799sb2220: exception: The remote server returned an internal server error.
druid:py799sb2220: initial state has these 856s:
[{"tag"=>"856", "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/py799sb2220 $x SDR-PURL $x item $x file:py799sb2220%2F1592001.jp2 $x label:Volume 1, number 2, October 1972 $x sort:1.2 $x rights:group=stanford", "indicators"=>["4", "1"], "isProtected"=>false}]
druid:py799sb2220: removing this 856
[{"tag"=>"856", "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/py799sb2220 $x SDR-PURL $x item $x file:py799sb2220%2F1592001.jp2 $x label:Volume 1, number 2, October 1972 $x sort:1.2 $x rights:group=stanford", "indicators"=>["4", "1"], "isProtected"=>false}]
druid:py799sb2220: releasing to SW, so appending {"tag"=>"856", "isProtected"=>false, "indicators"=>["4", "1"], "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/py799sb2220 $x SDR-PURL $x item $x file:py799sb2220%2F1592001.jp2 $x label:Volume 1, number 2, October 1972 $x sort:1.2 $x rights:group=stanford"}
druid:py799sb2220: new state has these 856s
[{"tag"=>"856", "isProtected"=>false, "indicators"=>["4", "1"], "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/py799sb2220 $x SDR-PURL $x item $x file:py799sb2220%2F1592001.jp2 $x label:Volume 1, number 2, October 1972 $x sort:1.2 $x rights:group=stanford"}]
druid:py799sb2220: finished updating
 => [["a666"], ["a666"]

lwrubel commented 1 year ago

I thought this is what was happening, but now I'm not sure. Leaving it here while still investigating.

Each time the MARC source record is updated (via the PUT to /records-editor/records/{parsed_record_id}) the parsedRecordId for the source record changes. So if two operations are happening at the same time, it's possible that one of the operations will be trying to send updated MARC with an out-of-date parsedRecordId. And FOLIO responds with a 500 in that case.

lwrubel commented 1 year ago

Possibly relevant: https://issues.folio.org/browse/MODQM-217 "Spike: Determine a plan to address persistent 500 errors upon editing a quickMARC record"

lwrubel commented 1 year ago

I have not been able to definitively figure out why we get a 500 back in quick-marc operations that run in parallel. Running PUT requests back-to-back will succeed or appropriately return a 409 when an optimistic locking situation arises. If these operations run at the same time, however, a 500 is returned in the situation in which a 409 was expected.

Looking at the Folio logs with @shelleydoljack did not yield any enlightening information, aside from a sense that Kafka is a bottleneck.

Bumped up the default connection timeout in https://github.com/sul-dlss/folio_client/pull/63

shelleydoljack commented 1 year ago

Can you try this on folio-dev? We are running okapi version 4.14.10 there. Just wondering if some bug fixes between the two versions (folio-test has okapi version 4.14.8) has any bearing on anything.

lwrubel commented 1 year ago

I'm headed out now, but can give it a try on folio-dev first thing on Monday and will let you know!

lwrubel commented 1 year ago

I'm getting a 500 response for one of the updates on folio-dev, so basically the same situation as folio-test. 😞

lwrubel commented 1 year ago

I asked in the Folio Slack about these 500 responses and one of the developers at EBSCO recommended we use Data Import instead, with an update job profile, for this scenario instead of QuickMarc "which can be slow".

lwrubel commented 1 year ago

Adding detail to the above, the recommendation from Kyle Banerjee and Ann-Marie Breaux is to:

"Use a load profile that updates records. If the URLs don't originate from MARC but need to be added to existing records, they would either need to be first exported from FOLIO as MARC or converted to MARC from a textual or JSON representation. If many records must be updated at once, this might be significantly faster as well as avoiding the locking issues. QM is convenient to automate, but it is also very slow.

OL won't block you if you go the DI route -- so you'll want things planned so MARC records you're enriching with 856s aren't receiving other types of updates while they're out. But this will be so much faster that unless you're really modifying a lot of records, the total turnaround time from start of export to updated records fully imported should be a few minutes once you have the MARC modification process dialed in."

andrewjbtw commented 1 year ago

If many records must be updated at once, this might be significantly faster as well as avoiding the locking issues.

Their comments seem to misunderstand this specific issue. It's not updating many records once, it's updating one record many times.

lwrubel commented 1 year ago

To update what we're currently observing:

1) Druid 1 gets the source record via quickMarc, adds/deletes an 856, and sends a PUT with the current parsedRecordId (SRS id). 2) Update to record by Druid 1 returns a 202 response. 3) Druid 2 is simultaneously getting the same source record, adds/deletes an 856, and sends a PUT with the same parsedRecordId. 4) mod-source-record-storage logs show the following and a 500 (or 302) is returned.

ERROR uickMarcKafkaHandler Failed to handle QM_RECORD_UPDATED event
io.vertx.pgclient.PgException: ERROR: duplicate key value violates unique constraint "idx_records_matched_id_gen" (23505)

5) Druid 2 sleeps 10 and then starts a retry. 6) Druid 2 gets the source record via quickMarc. The source record has a new parsedRecordId andrecordState: ERROR. It does not include the 856 added by Druid 1. 7) Druid 2 updates the record with its 856 and sends a PUT request. Response is 202. 8) Record shows only the 856 from Druid 2.

shelleydoljack commented 1 year ago

                                                List of relations
            Schema             |            Name            | Type  |             Owner             |   Table    
-------------------------------+----------------------------+-------+-------------------------------+------------
 sul_mod_source_record_storage | idx_records_matched_id_gen | index | sul_mod_source_record_storage | records_lb

select * from sul_mod_source_record_storage.records_lb where false;
id | snapshot_id | matched_id | generation | record_type | external_id | state | leader_record_status | order | suppress_discovery | created_by_user_id | created_date | updated_by_user_id | updated_date | external_hrid 
----+-------------+------------+------------+-------------+-------------+-------+----------------------+-------+--------------------+--------------------+--------------+--------------------+--------------+---------------

select * from sul_mod_source_record_storage.records_lb where matched_id = '6abdaa36-6880-580c-a4b6-0008894d02a8';
(see attachment)

I think the answer to the duplicate key value constraint is somewhere in this table data. update_856_troubleshooting.txt

Maybe you shouldn't be sending the same parsedRecordId when trying to do subsequent PUT's to the same record?

lwrubel commented 1 year ago

Thanks for the query results. That's interesting to see the table involved.

The PUTs are sent at roughly the same time by two different processes. IRL it would be even more simultaneous attempts on the same parsedRecordId.

In some other testing, we retried fetching the source record until the state was "ACTUAL" instead of "ERROR" and then added the 856. But sometimes the source record never gets changed to ACTUAL unless something else triggers an update, so that's seeming to be an unreliable way to tell if the record is ready to be updated.

shelleydoljack commented 1 year ago

The records_lb table shows that parsedRecordId changes whenever the record is updated. So sending the same parsedRecordId PUT but with different data is probably confusing the modules.

So if mod-source-record-manager is supposed to keep track of state, then maybe checking its endpoints for the state and the newest parsedRecordId, querying by externalId, is what you should do (apologies if you are already doing this)? e.g. ${OKAPI_URL}/change-manager/parsedRecords?externalId=e80e027c-baef-55f7-adae-ec1b7342d46d

lwrubel commented 1 year ago

Next step is to implement queue-locking in DSA and update common-accessioning to handle the situation where a workflow step is not queued state and needs a retry loop. Slides from discussion of options: https://docs.google.com/presentation/d/1O57U5jHeNQP927y_SxBvKfss6st1vQcjDM5v738JAdQ/edit#slide=id.g25f3d9136f9_0_72

jmartin-sul commented 1 year ago

To be explicit, and to build on @lwrubel's last comment: we decided last week that we weren't likely to figure out our issue with preventing inadvertent overwrites when concurrent updates are made to the same HRID's 856 in Folio. We are still not even sure whether the issue is a bug in Folio's optimistic locking functionality, an issue with our incorrect usage of it from the SDR side, or a configuration with our Folio setup.

So, after more discussion late last week around the exact approach to use for making sure that these marc updates happen serially, at least for a given HRID, we revisited the idea of single-threading the updates using solely Sidekiq configuration, instead of managing a mutex ourselves.

In all versions so far, our plan was to turn the update-marc logic in DSA into a Sidekiq job, instead of letting it block on the REST call that triggers it. We had (per Laura's comment) initially settled on using something like preservation_catalog's UniqueJob mechanism, which would've thrown an error if two updates to the same HRID were to be queued simultaneously, but when we realized that meant a messy extraction of the logic for finding HRIDs so we could do that at the job level, and a deviation from pres cat's well-tested checking of a single lock ID to something new that checks multiple keys, we briefly shifted to the idea of managing a mutex in FolioWriter using a new DSA database table. But then @jcoyne encouraged us to revisit serialization entirely through Sidekiq config, and we settled on the following plan:

Instead of updating the MARC synchronously, ObjectsController#update_marc_record will be changed to queue a job (e.g. UpdateMarcJob, sent to a new update-marc queue in DSA), and will return a 2xx response code as long as the job is queued successfully. @lwrubel has some WIP for this part of the plan.
Said job will have the responsibility of marking the workflow step complete. This is similar to what preservation_catalog and preservation_robots do when the latter hands off responsibility for completing the validate-moab step to the former.
common-accessioning's releaseWF/update-marc robot will hit the same DSA endpoint, but will be changed to expect that a job is queued, and thus will avoid marking the workflow step complete itself, since the new DSA job will do that. see https://github.com/sul-dlss/common-accessioning/pull/1081/files
Single-threading will be achieved by having one DSA worker box run one extra Sidekiq process. This one extra process will have a concurrency of 1 (i.e. one thread), and it will be the only DSA Sidekiq process/thread handling the new update-marc queue. Thus, we will actually be single-threading all SDR MARC updates, regardless of a given update's druid or HRID. We are hopeful that this won't be an untenable performance hit. The nice thing about it is that it means we don't have to do any mutex management, implementation, or testing, and we offload all of that complexity to Sidekiq config. That felt safer to us, esp given the short time frame we have for getting something reliable working. A puppet PR and a shared_configs PR address the Sidekiq config part of things, hopefully with comments that keep us from inadvertently breaking things.

I think the order for merging/deploying the shared_configs, puppet, common-accessioning, and DSA PRs is:

shared_configs can go first; if no other parts are present, we'll just have an extra/unused sidekiq config file on the DSA boxes.
then puppet, common-accessioning, and DSA need to go together. the extra sidekiq process on one DSA worker box will need its new config file from shared_configs. DSA and common-accessioning will need to have matching expectations about the behavior of the DSA update_marc_record route. and the new DSA job code will need to be present for the successful startup of the new sidekiq worker process on the one worker box (i think, even if the queue that gets worked doesn't get created till its first use, since the worker processes will load as soon as sidekiq starts).

So, I think the easiest thing to do will be to get all 4 (?) PRs approved and mergeable across the affected codebases, and then we can do a coordinated merge/deploy (if we don't deploy proactively, I think the regular application of puppet changes would essentially deploy that one change in the absence of the other 3 parts).

sul-dlss / dor-services-app

856/Folio: when releasing more than one druid linked to the same Folio HRID, only one job actually succeeds #4454

How to Reproduce the Bug

Via Argo UI

VIA DSA Console