sul-dlss / dor-services-app

A Rails application exposing Digital Object Registry functions as a RESTful HTTP API
https://sul-dlss.github.io/dor-services-app/
Other
3 stars 2 forks source link

Improve handling of publish/shelve for deposits with 15,000+ files #5187

Open andrewjbtw opened 5 days ago

andrewjbtw commented 5 days ago

The issue

Deposits with 15,000+ files are taking a long time to publish. Long enough that we had to change a timeout this week to get one deposit through. We have a second deposit currently slowly moving through the publish step in accessionWF.

Production items that have had this problem recently:

There are multiple HB errors associated with these large deposits:

Observing the progress of these deposits, I think this is what's happening:

The result is that transfers that should take a few hours are taking at least a day, not counting the possible need for us to intervene in order to ensure that they complete the publish step.

To reproduce

I've created a stage object that has gotten stuck in the same way: https://argo-stage.stanford.edu/view/druid:gm804vr8973

It's currently caught in retries because of the "error acquiring lock" issue. The total size of that item is under 1MB but it has 16,828 files in it, so I think the problem is about number of files and not size of deposit.

Additional background

We don't get many deposits over 15,000 files - and we ask people not to exceed 25k files in H2 deposit - but we have been supporting deposits in that range since we launched the H2-Globus deposit integration. We did not see this constellation of issues before we re-did the publish/shelve approach during the recent versioning work.

andrewjbtw commented 5 days ago

Some additional observations:

I noticed today that files continue to be written to Stacks for tm782sf2963 even when sidekiq says there's no publish job running. A few thousand files have been written since about 5 PM today (it's now midnight) even though I killed the publish job in the sidekiq dashboard. That makes me think jobs continue to run after an error occurs and it seems likely that multiple retries have been writing files at the same time.

I confirmed that there are a lot of duplicates in /access-transfers. I make a list of ~35,000 files under 1 MB in size, then generated md5 checksums for them. Many files have been written 5+ times. These are the most frequently occurring hashes (representing a duplicated file) in my sample:

      7 e0f4840994c4c9d61471167d9c08d17c
      7 e693e2e76c85e331c5adef362464fbfb
      7 eb363fb32762ae24766b175794a3f858
      7 ec02fcacee1d8f14ea2a1bef1b2adb3b
      7 ee9006556d9e52bb11afa42d3bade517
      7 f047ad6338d8781affeaad92e9ac9e6f
      7 f104e16c7054ff0d87870d31d8862640
      7 f2b60bf782507ed6caa5db7807539275
      7 f32376261c00309395f9e4d2c1bdb336
      7 f32c5b47051a1b01b7679ca31ddd2055
      7 f385cf3217c97da56fd39a356186e057
      7 f4be3e869db7e50513ebff1527ff9323
      7 f5fea83e6ca111e189dd94c170271965
      7 fc5e1796a511d542ba28f2559465484c
      7 fc65fa6e23e715875357c356228811c9
      7 fca07daac869f4029e78100468e1685c
      7 fccb6254959b36a6f64c5f6e87eba8b0
      7 fd4b6d4c5aaddb606f744f25456a6aec
      8 019249d843b38662c5267491fea866d0
      8 01ae3b389207afbce9452023bb720e19
      8 0ce8c834154420f8577726405465c410
      8 0f8687440780e694a6c1a6bc69dce77c
      8 1b313cdfdd44b8687c893742427157b9
      8 274f98e8cf959ae4a3013845eefabdf6
      8 32ac8c39805ef474c16a31a833b51f78
      8 45e90a76fea53ca343472250fbeb8ec4
      8 4d7df8a1df670059661637faf560209e
      8 4f7ecfb977801cf6013ac01596ead441
      8 6b52b6a47ef5af678c92f52491b21910
      8 6c98060fa71cf54b439b21a34be59225
      8 72b5b40cfc7280de740810786ff046cd
      8 749f43c9f6bc20c9aacae04d3b80d8b6
      8 8096a323ab1b5acb8510a419f5fef8d9
      8 8149eb4f6a7940e422c68923392eb89f
      8 855f5e2a607a046ee43618565266c656
      8 89b32fd2a4e06ebc6c444a81d99ee8d6
      8 8e948776d7744629e068eec645d4e30b
      8 a8e270344d441b021c24dc3329b35449
      8 ac44af4b9095c2b90a08ea6e077d5fdf
      8 ac7b04f1427813b67e0210f45e26d4d6
      8 ae8e350b47bcab50e4e760b6e9624e02
      8 bb9c5c46c9f26b722ad5264e72799ca1
      8 c1f21e9e996cf8b69eed2390754bcde0
      8 c71390d1c7e9f9ce945fcbd47524f985
      8 cd67f1bc593dd57e48e3f665b63b997b
      8 d2f56c3d7a11c0c44ec18630b8fc4c3e
      8 ea5ca469a6d144015e273754d879d43f
      9 2c90d38043382c3cfca399b3149d55b7
      9 a07c8962e02fb5b0b58395b3339e8373
     37 d41d8cd98f00b204e9800998ecf8427e
justinlittman commented 5 days ago

The core problem is that DSA is making a synchronous call to purl-fetcher for publish/shelving, during which it waits for purl-fetcher to complete shelving. Since for large objects, shelving involves copying large number of files, this synchronous call is prone to timeout. This may leave behind a lock file (that blocks republishing attempts) and extra files in the /access-transfer file system (which are cleaned up automatically after a week).

While the timeout on the call from DSA can be increased, that is a fragile solution. The most straight-forward fix is to make purl-fetcher's publish/shelve work by an asynchronous job. DSA can either poll for the status or be notified by a web hook when completed.

edsu commented 5 days ago

I haven't confirmed this from examining the logs, but I suspect that purl-fetcher continues its work even after DSA's request has timed out. When DSA's request times out it turns around and tries again, thus encountering the .lock problem until purl-fetcher has finished and removed the .lock.

edsu commented 5 days ago

@justinlittman to prevent unnecessary recopying of files to /access-transfer would it be feasible for ShelvableFilesStager or TransferStager to know if the files have already been copied, and turn the copy into a no-op?

https://github.com/sul-dlss/dor-services-app/blob/main/app/services/publish/metadata_transfer_service.rb#L74-L77

jcoyne commented 5 days ago

@edsu with the current technology in place (activestorage disk driver), that is not possible. It is making a unique token for each anticipated upload.