Open andrewjbtw opened 5 days ago
Some additional observations:
I noticed today that files continue to be written to Stacks for tm782sf2963 even when sidekiq says there's no publish
job running. A few thousand files have been written since about 5 PM today (it's now midnight) even though I killed the publish
job in the sidekiq dashboard. That makes me think jobs continue to run after an error occurs and it seems likely that multiple retries have been writing files at the same time.
I confirmed that there are a lot of duplicates in /access-transfers
. I make a list of ~35,000 files under 1 MB in size, then generated md5 checksums for them. Many files have been written 5+ times. These are the most frequently occurring hashes (representing a duplicated file) in my sample:
7 e0f4840994c4c9d61471167d9c08d17c
7 e693e2e76c85e331c5adef362464fbfb
7 eb363fb32762ae24766b175794a3f858
7 ec02fcacee1d8f14ea2a1bef1b2adb3b
7 ee9006556d9e52bb11afa42d3bade517
7 f047ad6338d8781affeaad92e9ac9e6f
7 f104e16c7054ff0d87870d31d8862640
7 f2b60bf782507ed6caa5db7807539275
7 f32376261c00309395f9e4d2c1bdb336
7 f32c5b47051a1b01b7679ca31ddd2055
7 f385cf3217c97da56fd39a356186e057
7 f4be3e869db7e50513ebff1527ff9323
7 f5fea83e6ca111e189dd94c170271965
7 fc5e1796a511d542ba28f2559465484c
7 fc65fa6e23e715875357c356228811c9
7 fca07daac869f4029e78100468e1685c
7 fccb6254959b36a6f64c5f6e87eba8b0
7 fd4b6d4c5aaddb606f744f25456a6aec
8 019249d843b38662c5267491fea866d0
8 01ae3b389207afbce9452023bb720e19
8 0ce8c834154420f8577726405465c410
8 0f8687440780e694a6c1a6bc69dce77c
8 1b313cdfdd44b8687c893742427157b9
8 274f98e8cf959ae4a3013845eefabdf6
8 32ac8c39805ef474c16a31a833b51f78
8 45e90a76fea53ca343472250fbeb8ec4
8 4d7df8a1df670059661637faf560209e
8 4f7ecfb977801cf6013ac01596ead441
8 6b52b6a47ef5af678c92f52491b21910
8 6c98060fa71cf54b439b21a34be59225
8 72b5b40cfc7280de740810786ff046cd
8 749f43c9f6bc20c9aacae04d3b80d8b6
8 8096a323ab1b5acb8510a419f5fef8d9
8 8149eb4f6a7940e422c68923392eb89f
8 855f5e2a607a046ee43618565266c656
8 89b32fd2a4e06ebc6c444a81d99ee8d6
8 8e948776d7744629e068eec645d4e30b
8 a8e270344d441b021c24dc3329b35449
8 ac44af4b9095c2b90a08ea6e077d5fdf
8 ac7b04f1427813b67e0210f45e26d4d6
8 ae8e350b47bcab50e4e760b6e9624e02
8 bb9c5c46c9f26b722ad5264e72799ca1
8 c1f21e9e996cf8b69eed2390754bcde0
8 c71390d1c7e9f9ce945fcbd47524f985
8 cd67f1bc593dd57e48e3f665b63b997b
8 d2f56c3d7a11c0c44ec18630b8fc4c3e
8 ea5ca469a6d144015e273754d879d43f
9 2c90d38043382c3cfca399b3149d55b7
9 a07c8962e02fb5b0b58395b3339e8373
37 d41d8cd98f00b204e9800998ecf8427e
The core problem is that DSA is making a synchronous call to purl-fetcher for publish/shelving, during which it waits for purl-fetcher to complete shelving. Since for large objects, shelving involves copying large number of files, this synchronous call is prone to timeout. This may leave behind a lock file (that blocks republishing attempts) and extra files in the /access-transfer
file system (which are cleaned up automatically after a week).
While the timeout on the call from DSA can be increased, that is a fragile solution. The most straight-forward fix is to make purl-fetcher's publish/shelve work by an asynchronous job. DSA can either poll for the status or be notified by a web hook when completed.
I haven't confirmed this from examining the logs, but I suspect that purl-fetcher continues its work even after DSA's request has timed out. When DSA's request times out it turns around and tries again, thus encountering the .lock
problem until purl-fetcher has finished and removed the .lock
.
@justinlittman to prevent unnecessary recopying of files to /access-transfer
would it be feasible for ShelvableFilesStager or TransferStager to know if the files have already been copied, and turn the copy into a no-op?
@edsu with the current technology in place (activestorage disk driver), that is not possible. It is making a unique token for each anticipated upload.
The issue
Deposits with 15,000+ files are taking a long time to publish. Long enough that we had to change a timeout this week to get one deposit through. We have a second deposit currently slowly moving through the
publish
step in accessionWF.Production items that have had this problem recently:
There are multiple HB errors associated with these large deposits:
Observing the progress of these deposits, I think this is what's happening:
/access-transfer
- many more than are contained in the original deposit. It's likely the retries are rewriting the same files repeatedly.The result is that transfers that should take a few hours are taking at least a day, not counting the possible need for us to intervene in order to ensure that they complete the publish step.
To reproduce
I've created a stage object that has gotten stuck in the same way: https://argo-stage.stanford.edu/view/druid:gm804vr8973
It's currently caught in retries because of the "error acquiring lock" issue. The total size of that item is under 1MB but it has 16,828 files in it, so I think the problem is about number of files and not size of deposit.
Additional background
We don't get many deposits over 15,000 files - and we ask people not to exceed 25k files in H2 deposit - but we have been supporting deposits in that range since we launched the H2-Globus deposit integration. We did not see this constellation of issues before we re-did the publish/shelve approach during the recent versioning work.