Inconsistent assumed file locations in association files

stscijgbot-jp commented 3 years ago

Issue JP-2038 was created on JIRA by Bryan Hilbert:

I'm developing a set of notebooks for the JWebbinars that show how to run the various stages of the pipeline when using imaging data. I've noticed what seems to be an inconsistency in the assumed file locations when association files list relative paths between level 2 and level 3.

For calwebb_image2, it seems that relative paths within the association file are interpreted as being relative to the directory that the pipeline is running in, while for calwebb_image3 the relative paths in the association file are interpreted as being relative to where the association file is sitting.

Here are the details of my set up:

I'm working in a directory called 'imaging_mode'. In this directory are my notebooks for running the pipelines. In a subdirectory called 'Stage1' I have rate files and a level 2 association file. In a subdirectory called 'Stage2' I have _cal.fits files and a level 3 association file.

When running my stage 2 notebook in the 'imaging_mode' directory, I call calwebb_image2 and provide it with the association file 'Stage1/level2_lw_asn.json'. Within that association file, I have the member files listed as (e.g.) 'Stage1/my_file_rate.fits'. In this configuration, calwebb_image2 runs successfully. If I change the members in the association file by removing the 'Stage1', then the pipeline fails because it doesn't find the rate files.

However, when running calwebb_image3 in a similar way, I encounter problems. Again, I run my notebook in the 'imaging_mode' directory. I call the pipeline and give it my association file, which is in the Stage2 subdirectory: 'Stage2/level3_lw_asn.json'. Within the association file, if I list the members as (e.g.) 'Stage2/my_file_cal.fits', (which is the case that worked for calwebb_image2), the pipeline fails because it says it is looking for 'imaging_mode/Stage2/Stage2/my_file_cal.fits'. If I change the association file to remove the subdirectory from each member, then calwebb_image3 looks in the correct directory and runs successfully.

jdavies-st commented 3 years ago

Thanks for the bug report. We'll investigate.

A solution (the solution) for this is to never have paths in associations. Only file names. And run everything in the same directory. This is extra important when doing WFSS as the spec2 pipeline depends on the image3 pipeline having already run and uses its outputs.

This is how all our testing is done, and this is how SDP does it too. Filenames are unique, so there's no need to break the different stages up into directories.

Associations should never have paths - only filenames. Paths make them not portable. And you'll never see an association generated by SDP that does this. Maybe we should be very strict about associations not having paths?

kmacdonald-stsci commented 2 years ago

After looking at this issue and talking to some people, it is clear there is a bug due to ambiguity of the root directory for relative paths. As the issue is described the subdirectory "stage1" contains all input files needed to process stage 1, so there is no need to put relative paths in the JSON file. The option "--input_dir" can be passed "stage1" as an argument and it will process without the need for relative path names in the JSON file. Same with stage 2 processing.

Because of this and the feedback from James Davis all input files should be in a single directory and no path information should be in the JSON files, only file names. I will update the documentation to make this more clear.

Also, I will change the associations class to detect path information in names, instead of just file names, and raise an exception when attempting to do so. This will eliminate the ambiguity bug Bryan Hilbert noticed in this issue.