Closed andrewjbtw closed 1 year ago
Relevant HB alert: https://app.honeybadger.io/projects/50568/faults/94638749
This has stopped working. If I release the two items in the example in the ticket, only one succeeds. The other one works if I wait and then re-release. I see the same problem with un-releasing. Only one job succeeds.
@andrewjbtw Well, that wasn't supposed to happen. 😆
I'll re-open and check it out after lunch. Should I use the same test objects above?
@andrewjbtw I tried the release bulk action on the above two test objects and it seemed to work fine (releaseWF in both completed). Should I use diff test objects? Or run it n times in a row?
Let me try it with just those two. I was trying it with a larger batch of unrelated records and it never worked that way.
I just tried a few times and both druids had 856s successfully created only once from the same bulk job. More commonly, only one druid got an 856 for release, or only one 856 was removed going the other direction. I did individual druid releases to bring them into sync - both released, or both not released - in between tests.
The one that succeeded was the job that seemed to take the longest, measured by how long the druid had status "started" for update-marc. Maybe it needs more retries? Or a setting where it retries until it recognizes success?
@andrewjbtw OK, I'll keep using these test objects and see if I can get them into the bad state. Yeah, I was already thinking this might be in the "bump the timeout value" territory. Thanks for your testing.
Surprisingly, I never saw an error. The only sign was what happened to the 856.
Now that I think I understand what you're seeing, I wonder if this isn't about needing to change a timeout but about Folio locking not working as we'd expect it.
☝🏻 is a lot of clicking and waiting, so I worked on a DSA branch to make this a little quicker and easier to test. That's concurrent-856-edits#4454
, which you can deploy to QA. Once deployed, SSH to any DSA QA box and jump into the Rails console. Once there, paste in this Ruby code to make it easier to tweak the two test objects (by flipping the SW release flag on and off and storing the changed Cocina):
def flip_last_release_tag(release_tags)
release_tags[0...-1] + [release_tags.last.new(release: !release_tags.last.release)]
end
def release_unrelease!
cocina_object1 = CocinaObjectStore.find('druid:py799sb2220')
cocina_object2 = CocinaObjectStore.find('druid:dv133mk4688')
CocinaObjectStore.store(
cocina_object1.new(
administrative: cocina_object1.administrative.new(
releaseTags: flip_last_release_tag(cocina_object1.administrative.releaseTags)
)
), skip_lock: false
)
CocinaObjectStore.store(
cocina_object2.new(
administrative: cocina_object2.administrative.new(
releaseTags: flip_last_release_tag(cocina_object2.administrative.releaseTags)
)
), skip_lock: false
)
puts "Setting release to: #{[CocinaObjectStore.find('druid:py799sb2220').administrative.releaseTags.last.release, CocinaObjectStore.find('druid:dv133mk4688').administrative.releaseTags.last.release]}"
end
And then paste in this code that flips the Cocina release tag and then reproduces the problem:
release_unrelease!
Parallel.map(
[CocinaObjectStore.find('druid:py799sb2220'),
CocinaObjectStore.find('druid:dv133mk4688')]
) do |cocina_object|
Catalog::UpdateMarc856RecordService.update(cocina_object, thumbnail_service: ThumbnailService.new(cocina_object))
end
Assuming you deployed the concurrent-856-edits#4454
branch, you should get diagnostic output like the following, if unreleasing:
Setting release to: [false, false]
druid:dv133mk4688: writing to Folio
druid:py799sb2220: writing to Folio
druid:dv133mk4688: initial state has these 856s:
[]
druid:dv133mk4688: removing this 856
[]
druid:dv133mk4688: unreleasing to SW, so not appending
druid:dv133mk4688: new state has these 856s
[]
druid:py799sb2220: initial state has these 856s:
[]
druid:py799sb2220: removing this 856
[]
druid:py799sb2220: unreleasing to SW, so not appending
druid:py799sb2220: new state has these 856s
[]
druid:dv133mk4688: finished updating
Retrying Folio client operation for druid:py799sb2220 (1 tries)
druid:py799sb2220: exception: Net::ReadTimeout with #<TCPSocket:(closed)>
druid:py799sb2220: initial state has these 856s:
[]
druid:py799sb2220: removing this 856
[]
druid:py799sb2220: unreleasing to SW, so not appending
druid:py799sb2220: new state has these 856s
[]
druid:py799sb2220: finished updating
or like this if releasing:
Setting release to: [true, true
druid:py799sb2220: writing to Folio
druid:dv133mk4688: writing to Folio
druid:dv133mk4688: initial state has these 856s:
[]
druid:dv133mk4688: removing this 856
[]
druid:dv133mk4688: releasing to SW, so appending {"tag"=>"856", "isProtected"=>false, "indicators"=>["4", "1"], "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/dv133mk4688 $x SDR-PURL $x item $x file:dv133mk4688%2Fpage_0001.jp2 $x label:Volume 1, number 1, October 1972 $x sort:1.1 $x rights:group=stanford"}
druid:dv133mk4688: new state has these 856s
[{"tag"=>"856", "isProtected"=>false, "indicators"=>["4", "1"], "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/dv133mk4688 $x SDR-PURL $x item $x file:dv133mk4688%2Fpage_0001.jp2 $x label:Volume 1, number 1, October 1972 $x sort:1.1 $x rights:group=stanford"}]
druid:py799sb2220: initial state has these 856s:
[]
druid:py799sb2220: removing this 856
[]
druid:py799sb2220: releasing to SW, so appending {"tag"=>"856", "isProtected"=>false, "indicators"=>["4", "1"], "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/py799sb2220 $x SDR-PURL $x item $x file:py799sb2220%2F1592001.jp2 $x label:Volume 1, number 2, October 1972 $x sort:1.2 $x rights:group=stanford"}
druid:py799sb2220: new state has these 856s
[{"tag"=>"856", "isProtected"=>false, "indicators"=>["4", "1"], "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/py799sb2220 $x SDR-PURL $x item $x file:py799sb2220%2F1592001.jp2 $x label:Volume 1, number 2, October 1972 $x sort:1.2 $x rights:group=stanford"}]
druid:py799sb2220: finished updating
Retrying Folio client operation for druid:dv133mk4688 (1 tries)
druid:dv133mk4688: exception: Net::ReadTimeout with #<TCPSocket:(closed)>
druid:dv133mk4688: initial state has these 856s:
[{"tag"=>"856", "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/dv133mk4688 $x SDR-PURL $x item $x file:dv133mk4688%2Fpage_0001.jp2 $x label:Volume 1, number 1, October 1972 $x sort:1.1 $x rights:group=stanford", "indicators"=>["4", "1"], "isProtected"=>false}]
druid:dv133mk4688: removing this 856
[{"tag"=>"856", "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/dv133mk4688 $x SDR-PURL $x item $x file:dv133mk4688%2Fpage_0001.jp2 $x label:Volume 1, number 1, October 1972 $x sort:1.1 $x rights:group=stanford", "indicators"=>["4", "1"], "isProtected"=>false}]
druid:dv133mk4688: releasing to SW, so appending {"tag"=>"856", "isProtected"=>false, "indicators"=>["4", "1"], "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/dv133mk4688 $x SDR-PURL $x item $x file:dv133mk4688%2Fpage_0001.jp2 $x label:Volume 1, number 1, October 1972 $x sort:1.1 $x rights:group=stanford"}
druid:dv133mk4688: new state has these 856s
[{"tag"=>"856", "isProtected"=>false, "indicators"=>["4", "1"], "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/dv133mk4688 $x SDR-PURL $x item $x file:dv133mk4688%2Fpage_0001.jp2 $x label:Volume 1, number 1, October 1972 $x sort:1.1 $x rights:group=stanford"}]
druid:dv133mk4688: finished updating
=> [["a666"], ["a666"]]
You can see the bug in the latter paste when releasing, specifically here:
druid:dv133mk4688: initial state has these 856s:
[{"tag"=>"856", "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/dv133mk4688 $x SDR-PURL $x item $x file:dv133mk4688%2Fpage_0001.jp2 $x label:Volume 1, number 1, October 1972 $x sort:1.1 $x rights:group=stanford", "indicators"=>["4", "1"], "isProtected"=>false}]
The Folio write for py799sb2220 succeeds, at which point the record will have one 856 (for py799sb2220). The write for dv133mk4688 fails[^1], possibly because the write for py799sb2220 changed the version of the in-memory record, i.e., locking. What's important to note here is when the write for dv133mk4688 retries, it does get an incremented version of the record (verified by hand-jamming in a FolioClient patch on the server) but the 856s do not include the py799sb2220 856 so when the dv133mk4688 write succeeds next, it only contains the dv133mk4688 856.
[^1]: The failure is usually a timeout error but sometimes it is this error documented in the FolioClient gem, and it's unclear to me if this is just how Folio record locking works.
Thanks for setting up this branch and providing the helpful code to make it faster to test, @mjgiarlo.
Taking our folio_client out of the picture, and using the notebook @justinlittman created to test the quick-marc API, it is possible to set up a sequence of operations where Folio will return a 409 and a message that indicates that optimistic locking is the issue. Example message text:
{"message":"ERROR: Cannot update record 5108040a-65bc-40ed-bd50-265958301ce4 because it has been changed (optimistic locking): Stored _version is 24, _version of request is 23 (23F09)","type":"-3","code":"INTERNAL_SERVER_ERROR"}
Also in the notebook, trying multiple edits of the same record using Parallel
, I get the Net::ReadTimeout
error that was also showing up in DSA. Is there something about what happens with Parallel that is leading to the timeouts?
I tried bumping up the timeout config on the Faraday connection, but that didn't make a difference.
I also poked around the FOLIO Slack a bit, and while people do report seeing the duplicate key value violates unique constraint \"idx_records_matched_id_gen\"
error, it's in the context of using the data import module and loading duplicate records (so, different API, but also related to source-record-storage).
@lwrubel 💬
Thanks for setting up this branch and providing the helpful code to make it faster to test, @mjgiarlo.
🚀
Taking our folio_client out of the picture, and using the notebook @justinlittman created to test the quick-marc API, it is possible to set up a sequence of operations where Folio will return a 409 and a message that indicates that optimistic locking is the issue. Example message text:
{"message":"ERROR: Cannot update record 5108040a-65bc-40ed-bd50-265958301ce4 because it has been changed (optimistic locking): Stored _version is 24, _version of request is 23 (23F09)","type":"-3","code":"INTERNAL_SERVER_ERROR"}
Ah, that's good to see! I wonder if the Folio client code needs tweaking so it can see the same 409 error?
Also in the notebook, trying multiple edits of the same record using
Parallel
, I get theNet::ReadTimeout
error that was also showing up in DSA. Is there something about what happens with Parallel that is leading to the timeouts?
Not that I can think of. The use of Parallel
is to approximate what's happening when users invoke a bulk action like above. We could try taking Parallel
out of the mix and invoking the two updates in rapid succession to see if any behavior or outcomes change? (E.g., joining the operations with a &&
?)
I tried bumping up the timeout config on the Faraday connection, but that didn't make a difference.
Interesting. I'm curious if you bumped the timeout, the wait, or both?
I also poked around the FOLIO Slack a bit, and while people do report seeing the
duplicate key value violates unique constraint \"idx_records_matched_id_gen\"
error, it's in the context of using the data import module and loading duplicate records (so, different API, but also related to source-record-storage).
Thanks for poking around! 🤔
If you do the updates in rapid succession:
cocina_object1 = CocinaObjectStore.find('druid:py799sb2220')
cocina_object2 = CocinaObjectStore.find('druid:dv133mk4688')
Catalog::UpdateMarc856RecordService.update(cocina_object1, thumbnail_service: ThumbnailService.new(cocina_object1)) && Catalog::UpdateMarc856RecordService.update(cocina_object2, thumbnail_service: ThumbnailService.new(cocina_object2))
then they release successfully.
Returning to the timeout error, I noticed when I added request: { timeout: 120 }
when instantiating a Faraday connection, that the Parallel operation does not time out in the notebook. The record gets updated from multiple operations. request: { timeout: 60 }
does not help make a difference. Added this in a branch for the folio_client and deployed to DSA QA. Now the diagnostic output does not show a timeout, but:
druid:py799sb2220: exception: The remote server returned an internal server error.
So still not a 409, but a 500?
druid:dv133mk4688: writing to Folio
druid:py799sb2220: writing to Folio
druid:py799sb2220: initial state has these 856s:
[{"tag"=>"856", "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/py799sb2220 $x SDR-PURL $x item $x file:py799sb2220%2F1592001.jp2 $x label:Volume 1, number 2, October 1972 $x sort:1.2 $x rights:group=stanford", "indicators"=>["4", "1"], "isProtected"=>false}]
druid:py799sb2220: removing this 856
[{"tag"=>"856", "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/py799sb2220 $x SDR-PURL $x item $x file:py799sb2220%2F1592001.jp2 $x label:Volume 1, number 2, October 1972 $x sort:1.2 $x rights:group=stanford", "indicators"=>["4", "1"], "isProtected"=>false}]
druid:py799sb2220: releasing to SW, so appending {"tag"=>"856", "isProtected"=>false, "indicators"=>["4", "1"], "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/py799sb2220 $x SDR-PURL $x item $x file:py799sb2220%2F1592001.jp2 $x label:Volume 1, number 2, October 1972 $x sort:1.2 $x rights:group=stanford"}
druid:py799sb2220: new state has these 856s
[{"tag"=>"856", "isProtected"=>false, "indicators"=>["4", "1"], "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/py799sb2220 $x SDR-PURL $x item $x file:py799sb2220%2F1592001.jp2 $x label:Volume 1, number 2, October 1972 $x sort:1.2 $x rights:group=stanford"}]
druid:dv133mk4688: initial state has these 856s:
[{"tag"=>"856", "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/py799sb2220 $x SDR-PURL $x item $x file:py799sb2220%2F1592001.jp2 $x label:Volume 1, number 2, October 1972 $x sort:1.2 $x rights:group=stanford", "indicators"=>["4", "1"], "isProtected"=>false}]
druid:dv133mk4688: removing this 856
[]
druid:dv133mk4688: releasing to SW, so appending {"tag"=>"856", "isProtected"=>false, "indicators"=>["4", "1"], "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/dv133mk4688 $x SDR-PURL $x item $x file:dv133mk4688%2Fpage_0001.jp2 $x label:Volume 1, number 1, October 1972 $x sort:1.1 $x rights:group=stanford"}
druid:dv133mk4688: new state has these 856s
[{"tag"=>"856", "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/py799sb2220 $x SDR-PURL $x item $x file:py799sb2220%2F1592001.jp2 $x label:Volume 1, number 2, October 1972 $x sort:1.2 $x rights:group=stanford", "indicators"=>["4", "1"], "isProtected"=>false}, {"tag"=>"856", "isProtected"=>false, "indicators"=>["4", "1"], "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/dv133mk4688 $x SDR-PURL $x item $x file:dv133mk4688%2Fpage_0001.jp2 $x label:Volume 1, number 1, October 1972 $x sort:1.1 $x rights:group=stanford"}]
druid:dv133mk4688: finished updating
Retrying Folio client operation for druid:py799sb2220 (1 tries)
druid:py799sb2220: exception: The remote server returned an internal server error.
druid:py799sb2220: initial state has these 856s:
[{"tag"=>"856", "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/py799sb2220 $x SDR-PURL $x item $x file:py799sb2220%2F1592001.jp2 $x label:Volume 1, number 2, October 1972 $x sort:1.2 $x rights:group=stanford", "indicators"=>["4", "1"], "isProtected"=>false}]
druid:py799sb2220: removing this 856
[{"tag"=>"856", "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/py799sb2220 $x SDR-PURL $x item $x file:py799sb2220%2F1592001.jp2 $x label:Volume 1, number 2, October 1972 $x sort:1.2 $x rights:group=stanford", "indicators"=>["4", "1"], "isProtected"=>false}]
druid:py799sb2220: releasing to SW, so appending {"tag"=>"856", "isProtected"=>false, "indicators"=>["4", "1"], "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/py799sb2220 $x SDR-PURL $x item $x file:py799sb2220%2F1592001.jp2 $x label:Volume 1, number 2, October 1972 $x sort:1.2 $x rights:group=stanford"}
druid:py799sb2220: new state has these 856s
[{"tag"=>"856", "isProtected"=>false, "indicators"=>["4", "1"], "content"=>"$z Available to Stanford-affiliated users. $u https://sul-purl-stage.stanford.edu/py799sb2220 $x SDR-PURL $x item $x file:py799sb2220%2F1592001.jp2 $x label:Volume 1, number 2, October 1972 $x sort:1.2 $x rights:group=stanford"}]
druid:py799sb2220: finished updating
=> [["a666"], ["a666"]
I thought this is what was happening, but now I'm not sure. Leaving it here while still investigating.
Each time the MARC source record is updated (via the PUT to /records-editor/records/{parsed_record_id}
) the parsedRecordId
for the source record changes. So if two operations are happening at the same time, it's possible that one of the operations will be trying to send updated MARC with an out-of-date parsedRecordId
. And FOLIO responds with a 500 in that case.
Possibly relevant: https://issues.folio.org/browse/MODQM-217 "Spike: Determine a plan to address persistent 500 errors upon editing a quickMARC record"
I have not been able to definitively figure out why we get a 500
back in quick-marc operations that run in parallel. Running PUT
requests back-to-back will succeed or appropriately return a 409
when an optimistic locking situation arises. If these operations run at the same time, however, a 500
is returned in the situation in which a 409
was expected.
Looking at the Folio logs with @shelleydoljack did not yield any enlightening information, aside from a sense that Kafka is a bottleneck.
Bumped up the default connection timeout in https://github.com/sul-dlss/folio_client/pull/63
Can you try this on folio-dev? We are running okapi version 4.14.10 there. Just wondering if some bug fixes between the two versions (folio-test has okapi version 4.14.8) has any bearing on anything.
I'm headed out now, but can give it a try on folio-dev first thing on Monday and will let you know!
I'm getting a 500
response for one of the updates on folio-dev, so basically the same situation as folio-test. 😞
I asked in the Folio Slack about these 500 responses and one of the developers at EBSCO recommended we use Data Import instead, with an update job profile, for this scenario instead of QuickMarc "which can be slow".
Adding detail to the above, the recommendation from Kyle Banerjee and Ann-Marie Breaux is to:
"Use a load profile that updates records. If the URLs don't originate from MARC but need to be added to existing records, they would either need to be first exported from FOLIO as MARC or converted to MARC from a textual or JSON representation. If many records must be updated at once, this might be significantly faster as well as avoiding the locking issues. QM is convenient to automate, but it is also very slow.
OL won't block you if you go the DI route -- so you'll want things planned so MARC records you're enriching with 856s aren't receiving other types of updates while they're out. But this will be so much faster that unless you're really modifying a lot of records, the total turnaround time from start of export to updated records fully imported should be a few minutes once you have the MARC modification process dialed in."
If many records must be updated at once, this might be significantly faster as well as avoiding the locking issues.
Their comments seem to misunderstand this specific issue. It's not updating many records once, it's updating one record many times.
To update what we're currently observing:
1) Druid 1 gets the source record via quickMarc, adds/deletes an 856, and sends a PUT with the current parsedRecordId (SRS id). 2) Update to record by Druid 1 returns a 202 response. 3) Druid 2 is simultaneously getting the same source record, adds/deletes an 856, and sends a PUT with the same parsedRecordId. 4) mod-source-record-storage logs show the following and a 500 (or 302) is returned.
ERROR uickMarcKafkaHandler Failed to handle QM_RECORD_UPDATED event
io.vertx.pgclient.PgException: ERROR: duplicate key value violates unique constraint "idx_records_matched_id_gen" (23505)
5) Druid 2 sleeps 10 and then starts a retry.
6) Druid 2 gets the source record via quickMarc. The source record has a new parsedRecordId andrecordState: ERROR
. It does not include the 856 added by Druid 1.
7) Druid 2 updates the record with its 856 and sends a PUT request. Response is 202.
8) Record shows only the 856 from Druid 2.
List of relations
Schema | Name | Type | Owner | Table
-------------------------------+----------------------------+-------+-------------------------------+------------
sul_mod_source_record_storage | idx_records_matched_id_gen | index | sul_mod_source_record_storage | records_lb
select * from sul_mod_source_record_storage.records_lb where false;
id | snapshot_id | matched_id | generation | record_type | external_id | state | leader_record_status | order | suppress_discovery | created_by_user_id | created_date | updated_by_user_id | updated_date | external_hrid
----+-------------+------------+------------+-------------+-------------+-------+----------------------+-------+--------------------+--------------------+--------------+--------------------+--------------+---------------
select * from sul_mod_source_record_storage.records_lb where matched_id = '6abdaa36-6880-580c-a4b6-0008894d02a8';
(see attachment)
I think the answer to the duplicate key value constraint is somewhere in this table data. update_856_troubleshooting.txt
Maybe you shouldn't be sending the same parsedRecordId when trying to do subsequent PUT's to the same record?
Thanks for the query results. That's interesting to see the table involved.
The PUTs are sent at roughly the same time by two different processes. IRL it would be even more simultaneous attempts on the same parsedRecordId.
In some other testing, we retried fetching the source record until the state was "ACTUAL" instead of "ERROR" and then added the 856. But sometimes the source record never gets changed to ACTUAL unless something else triggers an update, so that's seeming to be an unreliable way to tell if the record is ready to be updated.
The records_lb table shows that parsedRecordId changes whenever the record is updated. So sending the same parsedRecordId PUT but with different data is probably confusing the modules.
So if mod-source-record-manager is supposed to keep track of state, then maybe checking its endpoints for the state and the newest parsedRecordId, querying by externalId, is what you should do (apologies if you are already doing this)? e.g. ${OKAPI_URL}/change-manager/parsedRecords?externalId=e80e027c-baef-55f7-adae-ec1b7342d46d
Next step is to implement queue-locking in DSA and update common-accessioning to handle the situation where a workflow step is not queued state and needs a retry loop. Slides from discussion of options: https://docs.google.com/presentation/d/1O57U5jHeNQP927y_SxBvKfss6st1vQcjDM5v738JAdQ/edit#slide=id.g25f3d9136f9_0_72
To be explicit, and to build on @lwrubel's last comment: we decided last week that we weren't likely to figure out our issue with preventing inadvertent overwrites when concurrent updates are made to the same HRID's 856 in Folio. We are still not even sure whether the issue is a bug in Folio's optimistic locking functionality, an issue with our incorrect usage of it from the SDR side, or a configuration with our Folio setup.
So, after more discussion late last week around the exact approach to use for making sure that these marc updates happen serially, at least for a given HRID, we revisited the idea of single-threading the updates using solely Sidekiq configuration, instead of managing a mutex ourselves.
In all versions so far, our plan was to turn the update-marc logic in DSA into a Sidekiq job, instead of letting it block on the REST call that triggers it. We had (per Laura's comment) initially settled on using something like preservation_catalog's UniqueJob
mechanism, which would've thrown an error if two updates to the same HRID were to be queued simultaneously, but when we realized that meant a messy extraction of the logic for finding HRIDs so we could do that at the job level, and a deviation from pres cat's well-tested checking of a single lock ID to something new that checks multiple keys, we briefly shifted to the idea of managing a mutex in FolioWriter
using a new DSA database table. But then @jcoyne encouraged us to revisit serialization entirely through Sidekiq config, and we settled on the following plan:
ObjectsController#update_marc_record
will be changed to queue a job (e.g. UpdateMarcJob
, sent to a new update-marc
queue in DSA), and will return a 2xx response code as long as the job is queued successfully. @lwrubel has some WIP for this part of the plan.validate-moab
step to the former.releaseWF
/update-marc
robot will hit the same DSA endpoint, but will be changed to expect that a job is queued, and thus will avoid marking the workflow step complete itself, since the new DSA job will do that. see https://github.com/sul-dlss/common-accessioning/pull/1081/filesconcurrency
of 1 (i.e. one thread), and it will be the only DSA Sidekiq process/thread handling the new update-marc
queue. Thus, we will actually be single-threading all SDR MARC updates, regardless of a given update's druid or HRID. We are hopeful that this won't be an untenable performance hit. The nice thing about it is that it means we don't have to do any mutex management, implementation, or testing, and we offload all of that complexity to Sidekiq config. That felt safer to us, esp given the short time frame we have for getting something reliable working. A puppet PR and a shared_configs PR address the Sidekiq config part of things, hopefully with comments that keep us from inadvertently breaking things.I think the order for merging/deploying the shared_configs, puppet, common-accessioning, and DSA PRs is:
update_marc_record
route. and the new DSA job code will need to be present for the successful startup of the new sidekiq worker process on the one worker box (i think, even if the queue that gets worked doesn't get created till its first use, since the worker processes will load as soon as sidekiq starts).So, I think the easiest thing to do will be to get all 4 (?) PRs approved and mergeable across the affected codebases, and then we can do a coordinated merge/deploy (if we don't deploy proactively, I think the regular application of puppet changes would essentially deploy that one change in the absence of the other 3 parts).
The digitized serials workflow involves associating multiple druids with a single catalog record. When releasing more than one druid at the same time (via bulk action) to the same catalog record, only one of the jobs succeeds. The other hits an error.
Example: these two druids are associated with the same catalog record https://argo-qa.stanford.edu/catalog?f%5Bexploded_tag_ssim%5D%5B%5D=856+%3A+serial&f%5Btag_ssim%5D%5B%5D=Project+%3A+856
To release at the same time:
Click on the "bulk action" button to the right above the search result list
Choose "new bulk action"
Choose "manage release" (this is the top of the list, so should be selected already
Keep "release this object" selected
Choose "Populate from previous search" to select both druids
Submit the job
Go back to the search
Wait for a while
Reload the search results
See that the second item has a 500 error
See HB errors as well
It looks like this is happening for both "release" and "unrelease". I'm guessing this is because only one edit can be made at a time.