errors from cron daemon about zip temp space (/sdr-transfers) cleanup on prod

jmartin-sul commented 1 year ago

Describe the bug

The find command that's executed by the rake task that cleans up the zip temp space (in which zips for cloud archiving are created and stored) is running into errors like the following:

find: '/sdr-transfers/wv/183': No such file or directory
find: '/sdr-transfers/wv/395': No such file or directory
find: '/sdr-transfers/wv/397': No such file or directory
find: '/sdr-transfers/wv/350': No such file or directory
find: '/sdr-transfers/wv/058': No such file or directory

or

find: '/sdr-transfers/cp/930/hp/5602/cp930hp5602.v0002.zip.md5': No such file or directory
find: '/sdr-transfers/cp/930/hp/5602/cp930hp5602.v0002.zip': No such file or directory
find: '/sdr-transfers/cp/873/mj/6372/cp873mj6372.v0002.zip.md5': No such file or directory
find: '/sdr-transfers/cp/873/mj/6372/cp873mj6372.v0002.zip': No such file or directory
find: '/sdr-transfers/cp/873/wd/5800/cp873wd5800.v0002.zip.md5': No such file or directory

User Impact

No serious deleterious user impact, as far as I can tell. I.e., it doesn't seem like we're deleting things that shouldn't be deleted and e.g. failing to send content to cloud archive endpoints.

But since we maintain many services, it's good to minimize noise in our error reporting, so that we don't accidentally overlook real problems.

Object(s) In This Error State

Nothing really stays in this error state, as far as I can tell, since the issue appears to be excessive attempts to delete something that has already been deleted, but which should be deleted.

Steps to Reproduce, if Known

🤷 it'll probably just keep happening intermittently, based on the pattern these last few weeks?

Expected behavior

Zip temp space cleanup works without error.

Screenshots

N/A, text above is better

Additional context

Based on cursory investigation, it's unclear what's causing this. Based on when I started seeing the emails from cron daemon, this seems to have (re-)started on 2023-05-01.

The number of errors from the find command can vary greatly from day to day, and seems (from a quick glance back at my email) to range anywhere between a couple errored deletion attempts and a few hundred (just under 1000 from what looked like the largest error listings when I re-skimmed these emails just now).

This looks very similar to #1181. That bug was filed in 2019, and closed in 2022. Nothing on that ticket indicates this, but I have a vague recollection that we saw a similar problem within the first couple years of pres cat going into production, and that the underlying issue was more than one VM running the cleanup cron job, leading to a race condition. That doesn't seem to be the case here, as I only see one instance of each type of cleanup cron job across all of the pres cat prod VMs (there's a cleanup job for the actual zip files and their corresponding checksum files, and then a cleanup job that comes along to get rid of empty directories after the file cleanup happens).

The find commands for cleanup are packaged as Rake tasks defined in lib/tasks/prescat.rake. They're added to crontab for VMs with the cache_cleaner role (currently only preservation-catalog-prod-04) by config/schedule.rb (using the Whenever gem, as usual for us).

Listing the crontab for each VM, I didn't see any erroneous looking entries (see comment below).

I'm a bit stumped as to what the problem is. My best guess right now, since this looks very much like a race condition, is that maybe the cleanup takes more than 24 hours to run on some nights, and so maybe one night's cleanup job find and attempts to delete files/dirs that the prior night's job also found, but hadn't gotten to yet.

jmartin-sul commented 1 year ago

here's the output from running crontab -l on each of the pres cat prod VMs (i ended up doing a loop using the shell instead of using the handy remote_execute cap task, because the latter is a bit noisy when there's stderr output, which we get for any VM with no crontab for pres when trying to list crontab entries):

% for host in 'preservation-catalog-web-prod-01.stanford.edu' 'preservation-catalog-web-prod-02.stanford.edu' 'preservation-catalog-prod-02.stanford.edu' 'preservation-catalog-prod-03.stanford.edu' 'preservation-catalog-prod-04.stanford.edu'; do echo "\n=== **$host**" && ssh pres@"$host" crontab -l; done

=== **preservation-catalog-web-prod-01.stanford.edu**
no crontab for pres

=== **preservation-catalog-web-prod-02.stanford.edu**

# Begin Whenever generated tasks for: preservation_catalog_prod_lb at: 2022-02-03 11:54:05 -0800

# End Whenever generated tasks for: preservation_catalog_prod_lb at: 2022-02-03 11:54:05 -0800

=== **preservation-catalog-prod-02.stanford.edu**

# Begin Whenever generated tasks for: preservation_catalog_prod at: 2023-07-10 16:11:51 -0700
0 11 1 * * /bin/bash -l -c 'cd /opt/app/pres/preservation_catalog/releases/20230710231051 && bin/rails runner -e production "MoabStorageRoot.find_each(&:m2c_check!)" >> /dev/null 2>> log/m2c-err.log && curl --silent https://api.honeybadger.io/v1/check_in/4dI5G4'

0 11 15 * * /bin/bash -l -c 'cd /opt/app/pres/preservation_catalog/releases/20230710231051 && bin/rails runner -e production "MoabStorageRoot.find_each(&:c2m_check!)" >> /dev/null 2>> log/c2m-err.log && curl --silent https://api.honeybadger.io/v1/check_in/LoIvVe'

0 20 * * * /bin/bash -l -c 'cd /opt/app/pres/preservation_catalog/releases/20230710231051 && bin/rails runner -e production "PreservedObject.order(last_archive_audit: :asc).limit(PreservedObject.daily_check_count).find_each(&:audit_moab_version_replication!)" >> /dev/null 2>> log/c2a-err.log'

0 22 * * * /bin/bash -l -c 'cd /opt/app/pres/preservation_catalog/releases/20230710231051 && bin/rails runner -e production "MoabRecord.order(last_checksum_validation: :asc).limit(MoabRecord.daily_check_count).find_each(&:validate_checksums!)" >> /dev/null 2>> log/cv-err.log'

0 0 * * 3 /bin/bash -l -c 'cd /opt/app/pres/preservation_catalog/releases/20230710231051 && bin/rails runner -e production "PreservedObject.archive_check_expired.find_each(&:audit_moab_version_replication!)" >> /dev/null 2>> log/c2a-err.log && curl --silent https://api.honeybadger.io/v1/check_in/NeI6v6'

0 1 * * 0 /bin/bash -l -c 'cd /opt/app/pres/preservation_catalog/releases/20230710231051 && bin/rails runner -e production "MoabStorageRoot.find_each(&:validate_expired_checksums!)" >> /dev/null 2>> log/cv-err.log && curl --silent https://api.honeybadger.io/v1/check_in/ZbI4Zw'

# End Whenever generated tasks for: preservation_catalog_prod at: 2023-07-10 16:11:51 -0700

=== **preservation-catalog-prod-03.stanford.edu**
no crontab for pres

=== **preservation-catalog-prod-04.stanford.edu**

# Begin Whenever generated tasks for: preservation_catalog_prod at: 2023-07-10 16:11:51 -0700
15 1 * * * /bin/bash -l -c 'cd /opt/app/pres/preservation_catalog/releases/20230710231051 && RAILS_ENV=production bundle exec rake prescat:cache_cleaner:empty_directories --silent >> /var/log/preservation_catalog/zip_cache_cleanup.log'

0 * * * * /bin/bash -l -c 'cd /opt/app/pres/preservation_catalog/releases/20230710231051 && RAILS_ENV=production bundle exec rake prescat:cache_cleaner:stale_files --silent >> /var/log/preservation_catalog/zip_cache_cleanup.log'

# End Whenever generated tasks for: preservation_catalog_prod at: 2023-07-10 16:11:51 -0700

jmartin-sul commented 1 year ago

other observations:

when the error occurs, we seem to see multiple error emails in the first few hours of the day (stale file cleanup is scheduled to run hourly on the hour, empty directory cleanup is scheduled to run daily at 1:15 am).
spot checking the recent emails, all errors seem to be coming from -04, which at least supports the hypothesis that the job isn't running on extra VMs.
cursory search of the interwebs leads me to believe that find <dir> <other options> -delete doesn't multithread by default. but i'm definitely not an expert on find, so grain of salt there.

sul-dlss / preservation_catalog

errors from cron daemon about zip temp space (/sdr-transfers) cleanup on prod #2250