spacetelescope / jwst

Python library for science observations from the James Webb Space Telescope
https://jwst-pipeline.readthedocs.io/en/latest/
Other
564 stars 167 forks source link

Inconsistent assumed file locations in association files #5950

Closed stscijgbot-jp closed 2 years ago

stscijgbot-jp commented 3 years ago

Issue JP-2038 was created on JIRA by Bryan Hilbert:

I'm developing a set of notebooks for the JWebbinars that show how to run the various stages of the pipeline when using imaging data. I've noticed what seems to be an inconsistency in the assumed file locations when association files list relative paths between level 2 and level 3. 

For calwebb_image2, it seems that relative paths within the association file are interpreted as being relative to the directory that the pipeline is running in, while for calwebb_image3 the relative paths in the association file are interpreted as being relative to where the association file is sitting.

Here are the details of my set up:

I'm working in a directory called 'imaging_mode'. In this directory are my notebooks for running the pipelines. In a subdirectory called 'Stage1' I have rate files and a level 2 association file. In a subdirectory called 'Stage2' I have _cal.fits files and a level 3 association file. 

When running my stage 2 notebook in the 'imaging_mode' directory, I call calwebb_image2 and provide it with the association file 'Stage1/level2_lw_asn.json'. Within that association file, I have the member files listed as (e.g.) 'Stage1/my_file_rate.fits'. In this configuration, calwebb_image2 runs successfully. If I change the members in the association file by removing the 'Stage1', then the pipeline fails because it doesn't find the rate files.

However, when running calwebb_image3 in a similar way, I encounter problems. Again, I run my notebook in the 'imaging_mode' directory. I call the pipeline and give it my association file, which is in the Stage2 subdirectory: 'Stage2/level3_lw_asn.json'. Within the association file, if I list the members as (e.g.) 'Stage2/my_file_cal.fits', (which is the case that worked for calwebb_image2), the pipeline fails because it says it is looking for 'imaging_mode/Stage2/Stage2/my_file_cal.fits'. If I change the association file to remove the subdirectory from each member, then calwebb_image3 looks in the correct directory and runs successfully.

 

jdavies-st commented 3 years ago

Thanks for the bug report. We'll investigate.

A solution (the solution) for this is to never have paths in associations. Only file names. And run everything in the same directory. This is extra important when doing WFSS as the spec2 pipeline depends on the image3 pipeline having already run and uses its outputs.

This is how all our testing is done, and this is how SDP does it too. Filenames are unique, so there's no need to break the different stages up into directories.

Associations should never have paths - only filenames. Paths make them not portable. And you'll never see an association generated by SDP that does this. Maybe we should be very strict about associations not having paths?

kmacdonald-stsci commented 2 years ago

After looking at this issue and talking to some people, it is clear there is a bug due to ambiguity of the root directory for relative paths. As the issue is described the subdirectory "stage1" contains all input files needed to process stage 1, so there is no need to put relative paths in the JSON file. The option "--input_dir" can be passed "stage1" as an argument and it will process without the need for relative path names in the JSON file. Same with stage 2 processing.

Because of this and the feedback from James Davis all input files should be in a single directory and no path information should be in the JSON files, only file names. I will update the documentation to make this more clear.

Also, I will change the associations class to detect path information in names, instead of just file names, and raise an exception when attempting to do so. This will eliminate the ambiguity bug Bryan Hilbert noticed in this issue.

stscijgbot-jp commented 2 years ago

Comment by Rosa Diaz on JIRA:

Bryan Hilbert When I started creating some notebooks for the simulations, I separated the calibration products by directory (as you are doing) but moved the products to their corresponding directories once these were created. I do agree with David's statement. Let me also add that this was considered when designing association and discussed in the past. It was deemed too difficult to implement and decided to assume all the inputs and outputs for the calibration software were in the same directory.

Is your concern that users might not know what data they will get at each level? This might be addressed by pointing them to the documentation that provides with the list of products to expect at different stages.

stscijgbot-jp commented 2 years ago

Comment by Bryan Hilbert on JIRA:

Perhaps this can be solved by a documentation update. My thought is that for anyone who decides to reprocess manually, they may decide to keep products from different stages in different directories, or they may decide to modify their association files for whatever reason. In that case, I think there should be some very clear documentation somewhere that users should keep everything in the same directory. And perhaps a warning (or even a graceful exit?) when a pipeline is running if it encounters a path in the association entries. I'm mainly concerned that if I didn't know this, and I'm fairly familiar with the pipelines, that outside users will also not know about this.

stscijgbot-jp commented 2 years ago

Comment by Jonathan Eisenhamer on JIRA:

Documentation does need to be more clear. However, there are some relevant parts: There is this in the introduction, and buried in the depths, this one-liner about the input_dir parameter and this about associations.

stscijgbot-jp commented 2 years ago

Comment by David Law on JIRA:

Ran across this ticket via James' slack post; just adding a note to say that file paths in associations are an important part of my pipeline workflow and that which we've been showing for some pipeline modes in the JWebbinars.  Must say I'd overlooked the statements about this in the documentation.

Explicitly forcing a flat directory structure seems like a bad idea, even if that's what default processing assumes.  This causes such an explosion of files that the first thing I do when downloading or processing any data is to split it up into subfolders to make it easier to find/explain what I'm looking for.

stscijgbot-jp commented 2 years ago

Comment by James Davies [X] on JIRA:

Associations are produced by standard DMS processing and will be downloadable from MAST for users' data. They cannot have full or relative paths in them, as then the association assumes some file structure and is non-portable. They are designed to live with the input data they reference.

And non-portability is a real problem.

I.e. as soon as you decide to share your data (which should include the associations), it breaks processing.

stscijgbot-jp commented 2 years ago

Comment by David Law on JIRA:

Jumping in here- it looks like PR #7008 explicitly and deliberately breaks all ability to use paths within associations (as I just tested and confirmed with this branch).  That's a huge step backward in user friendliness, and merging this PR will break both my MRS pipeline notebook workflow and that for all of the MRS notebooks that are being used widely throughout the community.  Same for the MIRI imaging notebooks in widespread use.

Just because https://jira.stsci.edu/browse/JP-2038 found inconsistencies for some offline use cases doesn't suggest that we should disable it entirely.  Indeed, https://jira.stsci.edu/browse/JP-2799 just went the opposite direction in fixing such pathname bugs.

From the MIRI point of view I would strongly oppose merging this change.

stscijgbot-jp commented 2 years ago

Comment by Howard Bushouse on JIRA:

David Law What kinds of paths are folks typically using in ASN files with the expnames? Just relative paths or are some absolute?

In order for this to be handled cleanly, folks really should've been using the "input_dir" and "output_dir" params for situations like this, rather than embedding the path names in the ASN files themselves, but at the same time I can appreciate how much stuff might get broken if we change the behavior now. It's too bad so much user-based code was developed years ago already when the pipeline was still essentially in a prototype stage of development and now has problematic things baked into it. Similar situation with the .call() and .run() methods for running steps/pipelines. In hindsight, which is of course always 20/20, .run() was a bad idea and we'd love to deprecate it (but probably can't, for the same reasons as discussed here for paths).

stscijgbot-jp commented 2 years ago

Comment by Kenneth MacDonald on JIRA:

I left a comment on the PR for this, but it's probably good to leave the same sort of comment here.  I don't think effort should be made to fix any problems users may encounter when using path data in association files for expname because it was explicitly designed not to and this limitation has been documented, but as you note users do it anyway.  Because of the latter, instead of raising an exception and enforcing no path data in association files a warning disavowing the use of path data may be a better approach.  This will give information to the user that they are doing something not supported.

stscijgbot-jp commented 2 years ago

Comment by David Law on JIRA:

Howard Bushouse In the notebooks that I'm most familiar with (mine and Karl Gordon's), pipeline steps are used with the output_dir parameters to determine where files should be written to.  Association files (pretty much just for spec3 or image3) are constructed by using glob to find files in the previous output_dir and and pass those paths+names to the association file generator.

Raising a warning instead of an exception sounds like a good middle path to me, in that it makes it clear that this case is user-beware but doesn't forbid it.

stscijgbot-jp commented 2 years ago

Comment by Howard Bushouse on JIRA:

Fixed by https://github.com/spacetelescope/jwst/pull/7008/

stscijgbot-jp commented 8 months ago

Comment by David Law on JIRA:

I believe the original cause of this issue is understood (paths in asn files), and the general answer is that the pipeline works best with all I/O in the same directory.  Likewise, at this point workarounds exist for the notebooks in question as well.  Closing this ticket until such time as the issue is raised again.