spedas / bleeding_edge

IDL-based Space Physics Environment Data Analysis Software (bleeding edge)
http://www.spedas.org
Other
7 stars 0 forks source link

Overzealous cleanup in SOC processing scripts #112

Closed jameswilburlewis closed 10 months ago

jameswilburlewis commented 11 months ago

I just had a very strange failure in a cron job -- partway through running an IDL script, it failed because it couldn't find file_touch.pro. The ksh script I was running unpacks the latest bleeding edge zip file in its working directory, under /mydisks/home/thmsoc on cronus. Sure enough, file_touch.pro was missing from the unpacked directory....along with nearly all of the other .pro files! There were only a handful left. From looking at modification timestamp on the directories that had been emptied out, it looks like it happened at 7:30 PM local time. The same thing had happened in the working directory for the previous execution, at the same time. It turns out there's another cron job that runs on cronus nightly at 7:30, and it contains this line:

maccs_ascii2all_cleanup.ksh:    find ${TMPDIR} -name \* -mtime +32 -exec /bin/rm -rf {} \;

TMPDIR and LOCAL are both set to /mydisks/home/thmsoc in the soc_it_to_me config script.

So in my script's working directory /mydisks/home/thmsoc/mms_unit_tests_all, all the IDL files with modification times more than 32 days in the past will get clobbered every night at 7:30 PM. (unzip preserves the original modification dates, so even a fresh copy will have "old" files)

It turns out there are several scripts that assume that anything under $TMPDIR is fair game to overwrite or delete. If two of them happen to run on the same host, both using the same /mydisks/home/thmsoc, they are likely to interfere with each other. Imagine the possibilities for a previously working cron job failing mysteriously, if it's moved to a different host, or even run at a different time of day...

The best solution is for each job using $TMPDIR , $LOCAL, or /mydisks/home/thmsoc to create a subdirectory, and confine all its file creation and deletion to that subdirectory. If there is any possibility that two or more copies of the same script can run at the same time (e.g. a reprocessing happening at the same time as the standard daily processing), the working directory name should probably get a suffix with a date+time, process ID, or other more-or-less unique tag.

Here's a (partial?) list of scripts with potentially dangerous 'rm' calls. The most potentially destructive scripts that should be fixed first would be the ones with "find ... -exec rm" style commands:

green_ascii2all_cleanup.ksh:    find ${TMPDIR} -name \* -mtime +14 -exec /bin/rm -rf {} \;
maccs_ascii2all_cleanup.ksh:    find ${TMPDIR} -name \* -mtime +32 -exec /bin/rm -rf {} \;
make_ae_index_cleanup.ksh:  find ${LOCAL} -name AEIndex\* -mtime +7 -exec /bin/rm -rf {} \;

Less serious, but still potential for conflicts:

green_ascii2all_new.ksh:    rm -f ${TMPDIR}/*.bm
green_ascii2all_new.ksh:    rm -f ${TMPDIR}/*.txt
green_ascii2all_new.ksh:    rm -f ${TMPDIR}/*.cdf
make_asi_orbits.ksh:        rm -rf ${TMPDIR}/mms
make_earthlunar_orbits_backprocess.ksh:        rm -rf ${TMPDIR}/mms
make_earthlunar_orbits_backprocess.ksh:        rm -rf ${TMPDIR}/ergsc
make_earthlunar_orbits.ksh:        rm -rf ${TMPDIR}/mms
make_earthlunar_orbits.ksh:        rm -rf ${TMPDIR}/ergsc
make_earth_orbits_backprocess.ksh:        rm -rf ${TMPDIR}/mms
make_earth_orbits_backprocess.ksh:        rm -rf ${TMPDIR}/ergsc
make_earth_orbits.ksh:        rm -rf ${TMPDIR}/mms
make_earth_orbits.ksh:        rm -rf ${TMPDIR}/ergsc
make_lunar_orbits.ksh:        rm -rf ${TMPDIR}/mms
step_ascii2all.ksh:    rm -f ${TMPDIR}/*.bm
step_ascii2all.ksh:    rm -f ${TMPDIR}/*.txt
step_ascii2all.ksh:    rm -f ${TMPDIR}/*.cdf
test_lunar_orbits.ksh:        rm -rf ${TMPDIR}/mms
ucla_flat2cdf1.ksh: rm -rf ${TMPDIR}/mastercdfs
ucla_flat2cdf.ksh:  rm -rf ${TMPDIR}/mastercdfs

create_probe_l1tol2_15days.ksh: rm -rf ${LOCAL}/l2_proc
create_probe_l1tol2_5days.ksh:  rm -rf ${LOCAL}/l2_proc
create_probe_l2cdf.ksh: rm -rf ${LOCAL}/l2_proc
create_probe_l2cdf_n.ksh:   rm -rf ${LOCAL}/l2_proc
create_probe_l2cdf_nostate.ksh: rm -rf ${LOCAL}/l2_proc
create_probe_l2cdf_vm2011.ksh:  rm -rf ${LOCAL}/l2_proc
clrussell90404 commented 11 months ago

Hi Jim,

I started to look into the issues you mentioned above but I am unable to access my desktop at UCLA. I use mobaXterm to access cronus at SSL to do any THEMIS soc processing work. I haven't bothered to set up anything on my mac since all my work is on my desktop at UCLA. Now it seems that either their network or something is down. I've run into this problem before but Ethan and or Austin were able to resolve the issues within a day or so. Since I've not had anything super critical on soc processing it's not been a problem, so I've not bothered to set up anything on my mac. With the problems you note above I think it's critical I work on this as soon as possible.

Do you have time to help me set up something on my mac? I'm around all day just let me know.

Thanks!

Cindy


From: Jim Lewis @.> Sent: Saturday, September 30, 2023 11:45 AM To: spedas/bleeding_edge @.> Cc: Russell, Cindy @.>; Assign @.> Subject: [spedas/bleeding_edge] Overzealous cleanup in SOC processing scripts (Issue #112)

I just had a very strange failure in a cron job -- partway through running an IDL script, it failed because it couldn't find file_touch.pro. The ksh script I was running unpacks the latest bleeding edge zip file in its working directory, under /mydisks/home/thmsoc on cronus. Sure enough, file_touch.pro was missing from the unpacked directory....along with nearly all of the other .pro files! There were only a handful left. From looking at modification timestamp on the directories that had been emptied out, it looks like it happened at 7:30 PM local time. The same thing had happened in the working directory for the previous execution, at the same time. It turns out there's another cron job that runs on cronus nightly at 7:30, and it contains this line:

maccs_ascii2all_cleanup.ksh: find ${TMPDIR} -name * -mtime +32 -exec /bin/rm -rf {} ;

TMPDIR and LOCAL are both set to /mydisks/home/thmsoc in the soc_it_to_me config script.

So in my script's working directory /mydisks/home/thmsoc/mms_unit_tests_all, all the IDL files with modification times more than 32 days in the past will get clobbered every night at 7:30 PM. (unzip preserves the original modification dates, so even a fresh copy will have "old" files)

It turns out there are several scripts that assume that anything under $TMPDIR is fair game to overwrite or delete. If two of them happen to run on the same host, both using the same /mydisks/home/thmsoc, they are likely to interfere with each other. Imagine the possibilities for a previously working cron job failing mysteriously, if it's moved to a different host, or even run at a different time of day...

The best solution is for each job using TMPDIR or /mydisks/home/thmsoc to create a subdirectory, and confine all its file creation and deletion to that subdirectory. If there is any possibility that two or more copies of the same script can run at the same time (e.g. a reprocessing happening at the same time as the standard daily processing), the working directory name should probably get a suffix with a date+time, process ID, or other more-or-less unique tag.

Here's a (partial?) list of scripts with potentially dangerous 'rm' calls. The most potentially destructive scripts that should be fixed first would be the ones with "find ... -exec rm" style commands:

green_ascii2all_cleanup.ksh: find ${TMPDIR} -name -mtime +14 -exec /bin/rm -rf {} ; maccs_ascii2all_cleanup.ksh: find ${TMPDIR} -name -mtime +32 -exec /bin/rm -rf {} ; make_ae_index_cleanup.ksh: find ${LOCAL} -name AEIndex* -mtime +7 -exec /bin/rm -rf {} ;

Less serious, but still potential for conflicts:

green_ascii2all_new.ksh: rm -f ${TMPDIR}/.bm green_ascii2all_new.ksh: rm -f ${TMPDIR}/.txt green_ascii2all_new.ksh: rm -f ${TMPDIR}/.cdf make_asi_orbits.ksh: rm -rf ${TMPDIR}/mms make_earthlunar_orbits_backprocess.ksh: rm -rf ${TMPDIR}/mms make_earthlunar_orbits_backprocess.ksh: rm -rf ${TMPDIR}/ergsc make_earthlunar_orbits.ksh: rm -rf ${TMPDIR}/mms make_earthlunar_orbits.ksh: rm -rf ${TMPDIR}/ergsc make_earth_orbits_backprocess.ksh: rm -rf ${TMPDIR}/mms make_earth_orbits_backprocess.ksh: rm -rf ${TMPDIR}/ergsc make_earth_orbits.ksh: rm -rf ${TMPDIR}/mms make_earth_orbits.ksh: rm -rf ${TMPDIR}/ergsc make_lunar_orbits.ksh: rm -rf ${TMPDIR}/mms step_ascii2all.ksh: rm -f ${TMPDIR}/.bm step_ascii2all.ksh: rm -f ${TMPDIR}/.txt step_ascii2all.ksh: rm -f ${TMPDIR}/.cdf test_lunar_orbits.ksh: rm -rf ${TMPDIR}/mms ucla_flat2cdf1.ksh: rm -rf ${TMPDIR}/mastercdfs ucla_flat2cdf.ksh: rm -rf ${TMPDIR}/mastercdfs

create_probe_l1tol2_15days.ksh: rm -rf ${LOCAL}/l2_proc create_probe_l1tol2_5days.ksh: rm -rf ${LOCAL}/l2_proc create_probe_l2cdf.ksh: rm -rf ${LOCAL}/l2_proc create_probe_l2cdf_n.ksh: rm -rf ${LOCAL}/l2_proc create_probe_l2cdf_nostate.ksh: rm -rf ${LOCAL}/l2_proc create_probe_l2cdf_vm2011.ksh: rm -rf ${LOCAL}/l2_proc

— Reply to this email directly, view it on GitHubhttps://github.com/spedas/bleeding_edge/issues/112, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A5NNFAQMWVQDAYW4KM7VNODX5BLD7ANCNFSM6AAAAAA5NWJKIA. You are receiving this because you were assigned.Message ID: @.***>