Open jmartin-sul opened 1 year ago
here's the output from running crontab -l
on each of the pres cat prod VMs (i ended up doing a loop using the shell instead of using the handy remote_execute
cap task, because the latter is a bit noisy when there's stderr output, which we get for any VM with no crontab for pres
when trying to list crontab entries):
% for host in 'preservation-catalog-web-prod-01.stanford.edu' 'preservation-catalog-web-prod-02.stanford.edu' 'preservation-catalog-prod-02.stanford.edu' 'preservation-catalog-prod-03.stanford.edu' 'preservation-catalog-prod-04.stanford.edu'; do echo "\n=== **$host**" && ssh pres@"$host" crontab -l; done
=== **preservation-catalog-web-prod-01.stanford.edu**
no crontab for pres
=== **preservation-catalog-web-prod-02.stanford.edu**
# Begin Whenever generated tasks for: preservation_catalog_prod_lb at: 2022-02-03 11:54:05 -0800
# End Whenever generated tasks for: preservation_catalog_prod_lb at: 2022-02-03 11:54:05 -0800
=== **preservation-catalog-prod-02.stanford.edu**
# Begin Whenever generated tasks for: preservation_catalog_prod at: 2023-07-10 16:11:51 -0700
0 11 1 * * /bin/bash -l -c 'cd /opt/app/pres/preservation_catalog/releases/20230710231051 && bin/rails runner -e production "MoabStorageRoot.find_each(&:m2c_check!)" >> /dev/null 2>> log/m2c-err.log && curl --silent https://api.honeybadger.io/v1/check_in/4dI5G4'
0 11 15 * * /bin/bash -l -c 'cd /opt/app/pres/preservation_catalog/releases/20230710231051 && bin/rails runner -e production "MoabStorageRoot.find_each(&:c2m_check!)" >> /dev/null 2>> log/c2m-err.log && curl --silent https://api.honeybadger.io/v1/check_in/LoIvVe'
0 20 * * * /bin/bash -l -c 'cd /opt/app/pres/preservation_catalog/releases/20230710231051 && bin/rails runner -e production "PreservedObject.order(last_archive_audit: :asc).limit(PreservedObject.daily_check_count).find_each(&:audit_moab_version_replication!)" >> /dev/null 2>> log/c2a-err.log'
0 22 * * * /bin/bash -l -c 'cd /opt/app/pres/preservation_catalog/releases/20230710231051 && bin/rails runner -e production "MoabRecord.order(last_checksum_validation: :asc).limit(MoabRecord.daily_check_count).find_each(&:validate_checksums!)" >> /dev/null 2>> log/cv-err.log'
0 0 * * 3 /bin/bash -l -c 'cd /opt/app/pres/preservation_catalog/releases/20230710231051 && bin/rails runner -e production "PreservedObject.archive_check_expired.find_each(&:audit_moab_version_replication!)" >> /dev/null 2>> log/c2a-err.log && curl --silent https://api.honeybadger.io/v1/check_in/NeI6v6'
0 1 * * 0 /bin/bash -l -c 'cd /opt/app/pres/preservation_catalog/releases/20230710231051 && bin/rails runner -e production "MoabStorageRoot.find_each(&:validate_expired_checksums!)" >> /dev/null 2>> log/cv-err.log && curl --silent https://api.honeybadger.io/v1/check_in/ZbI4Zw'
# End Whenever generated tasks for: preservation_catalog_prod at: 2023-07-10 16:11:51 -0700
=== **preservation-catalog-prod-03.stanford.edu**
no crontab for pres
=== **preservation-catalog-prod-04.stanford.edu**
# Begin Whenever generated tasks for: preservation_catalog_prod at: 2023-07-10 16:11:51 -0700
15 1 * * * /bin/bash -l -c 'cd /opt/app/pres/preservation_catalog/releases/20230710231051 && RAILS_ENV=production bundle exec rake prescat:cache_cleaner:empty_directories --silent >> /var/log/preservation_catalog/zip_cache_cleanup.log'
0 * * * * /bin/bash -l -c 'cd /opt/app/pres/preservation_catalog/releases/20230710231051 && RAILS_ENV=production bundle exec rake prescat:cache_cleaner:stale_files --silent >> /var/log/preservation_catalog/zip_cache_cleanup.log'
# End Whenever generated tasks for: preservation_catalog_prod at: 2023-07-10 16:11:51 -0700
other observations:
find <dir> <other options> -delete
doesn't multithread by default. but i'm definitely not an expert on find
, so grain of salt there.
Describe the bug
The
find
command that's executed by the rake task that cleans up the zip temp space (in which zips for cloud archiving are created and stored) is running into errors like the following:or
User Impact
No serious deleterious user impact, as far as I can tell. I.e., it doesn't seem like we're deleting things that shouldn't be deleted and e.g. failing to send content to cloud archive endpoints.
But since we maintain many services, it's good to minimize noise in our error reporting, so that we don't accidentally overlook real problems.
Object(s) In This Error State
Nothing really stays in this error state, as far as I can tell, since the issue appears to be excessive attempts to delete something that has already been deleted, but which should be deleted.
Steps to Reproduce, if Known
🤷 it'll probably just keep happening intermittently, based on the pattern these last few weeks?
Expected behavior
Zip temp space cleanup works without error.
Screenshots
N/A, text above is better
Additional context
Based on cursory investigation, it's unclear what's causing this. Based on when I started seeing the emails from cron daemon, this seems to have (re-)started on 2023-05-01.
The number of errors from the
find
command can vary greatly from day to day, and seems (from a quick glance back at my email) to range anywhere between a couple errored deletion attempts and a few hundred (just under 1000 from what looked like the largest error listings when I re-skimmed these emails just now).This looks very similar to #1181. That bug was filed in 2019, and closed in 2022. Nothing on that ticket indicates this, but I have a vague recollection that we saw a similar problem within the first couple years of pres cat going into production, and that the underlying issue was more than one VM running the cleanup cron job, leading to a race condition. That doesn't seem to be the case here, as I only see one instance of each type of cleanup cron job across all of the pres cat prod VMs (there's a cleanup job for the actual zip files and their corresponding checksum files, and then a cleanup job that comes along to get rid of empty directories after the file cleanup happens).
The
find
commands for cleanup are packaged as Rake tasks defined inlib/tasks/prescat.rake
. They're added to crontab for VMs with thecache_cleaner
role (currently onlypreservation-catalog-prod-04
) byconfig/schedule.rb
(using the Whenever gem, as usual for us).Listing the crontab for each VM, I didn't see any erroneous looking entries (see comment below).
I'm a bit stumped as to what the problem is. My best guess right now, since this looks very much like a race condition, is that maybe the cleanup takes more than 24 hours to run on some nights, and so maybe one night's cleanup job find and attempts to delete files/dirs that the prior night's job also found, but hadn't gotten to yet.