spacetelescope / jwst

Python library for science observations from the James Webb Space Telescope
https://jwst-pipeline.readthedocs.io/en/latest/
Other
570 stars 167 forks source link

level_3 cal collisions causing missing intermediate files #8729

Open stscijgbot-jp opened 2 months ago

stscijgbot-jp commented 2 months ago

Issue JP-3717 was created on JIRA by Hien Tran:

ops has seen evidence that concurrent level 3 pipeline processes for associations with common input members can step on each other, causing missing intermediate files (i.e., *outlier_id2.fits), and crash. 

a recent example is jw01568-c1000_20240819t100727_image3_00001 and {}jw01568-c1004_20240819t100727_image3_00001{}. the c1000 asn consists of observations o001 and o002, while c1004 asn contains o001, o002, and o003. ALL of the same members in c1000 are also in c1004. therefore, when intermediate files for c1000 got produced and +cleaned up+ afterwards, the same intermediate files produced by the c1004 process got removed by, and along with those in the first (c1000) process, and became unavailable when they were needed by the 2nd process.  

the ALOG.out logs for the two processes are attached, along with an sdiff between the listings of the *outlier_id2.fits files generated in the alog for the failed c1004 and those available on disk. note that all the missing files are for o001 and o002 – exactly those that got wiped out by the c1000 process. 

stscijgbot-jp commented 2 months ago

Comment by Tyler Pauly on JIRA:

One solution to the issue could be to alter the intermediate filenames to include an association or product name string, such that an exposure residing in multiple level 3 associations would have unique intermediate filenames if multiple associations are being processed simultaneously.

stscijgbot-jp commented 2 months ago

Comment by Brett Graham on JIRA:

What version of jwst was used for these runs?

stscijgbot-jp commented 2 months ago

Comment by Melanie Clarke on JIRA:

Another possible solution, discussed elsewhere, might be to save the necessary intermediate data to temp files instead of to named files in the output directory.