Request feedback on how data products are bundled in MAST/Portal

stscijgbot commented 5 years ago

Issue JP-739 was created by Alicia Canipe:

The May 21st DMS Associations meeting discussed some complications with the way data products are bundled in the MAST/Portal results table, and in downloaded zipped file bundles. The most significant issue was the mis-match between the file directory structure of the un-zipped downloads vs. the flat file organization currently expected by the CAL pipeline if users want to re-process data off-line. There are a few other issues too. See the [Association Meeting Notes|https://innerspace.stsci.edu/display/jwstdms/2019-05-21+Associations+Meeting+Notes] for details.

They would like for INS folks to offer their thoughts on this issue. It would be highly valuable to hear the “top 5” use cases for the circumstances where users will want to re-process JWST on their home machines. Some cases are easy to anticipate, such as spectral source extraction from MOS or slitless spectrograms, or subtracting PSFs from coronographic images. But knowing the most common cases could inform the choice of how best to package data.

stscijgbot commented 5 years ago

Comment by David Law: As discussed during the meeting yesterday, the mismatch between the archive structure (directories to organize data) and pipeline assumed structure (no directories, everything flat) suggests a few possible solutions:

1) MAST changes to deliver everything bundled together in one directory. That seems undesirable; with potentially thousands of files with long and complicated names it could be very challenging to tell what is what after downloading.

2) A script that converts the MAST download structure to a flat single directory for use with the pipeline. This doesn't seem to provide any advantages over (1), and might suffer from copying large quantities of files.

3) The JWST pipeline changes to allow for handling I/O in different directories. This seems much more compelling, flexible, and in line with other major data pipeline (e.g., SDSS).

This probably needs to be considered together with the planned change to cfg files. One option might be to have two top-level yaml-type files produced in the current working directory, a single config file that allows specification of overrides for any pipeline step (an example file that a user might modify could be produced by something like the collect_configs script) and a file list. The file list could be provided by MAST with any download bundle, and would specify a root input directory (presumably ./ by default), root output directory, and path to input files relative to the root input directory. Maybe with switches for whether to write exposure-level (e.g., DET1, SPEC2) into individual directories like the input structure or collect them elsewhere. Users could easily modify pipeline behaviour using these two files (instead of the current many cfg files) and have flexibility on where their data live.

Some potential use cases that would benefit from such a framework:

A) Instrument teams are reprocessing commissioning data between many team members, and frequently need to change pipeline parameters, override reference files, etc. This will be an extraordinarily common use case during commissioning. It would be great to download the uncal data from MAST once, with some sensible organization so that individual files can be inspected easily, and save them to a common location on central store so that all team members can work from the same raw data. The pipeline can pull inputs from the central store root directory and send outputs to a different directory in which I'm testing new reference file performance. I want to easily modify 15+ different pipeline parameters (skip this step, override this file, save results here, tweak this parameter there) and have a record of how I modified things.

B) Science user with strong opinions about how to defringe MRS data designs a program to get reference star observations at every dither location for their science target. They will download data from two different targets, run both through Spec2 with the fringing step turned off, and use their own tools to do something with the science and reference observations. For ease of organization they group science/reference data at each position into position-based directory structures so that they can find the files easily. When done, they want to run the Spec3 pipeline on the resulting changed files.

stscijgbot commented 5 years ago

Comment by Michael Regan: I would like to second David's points. The current requirement that all the input and output files need to be in the same directory will lead to a multiplication of copies of the same data. Also, if I want to compare the results from two different runs of the pipeline with different parameters, I'll need to full second copy of the uncal data.

It should not be too hard to for option 3 to be implemented.

stscijgbot commented 5 years ago

Comment by Rosa Diaz: I like option 3 too.

jdavies-st commented 5 years ago

There is already functionality for steps or a pipeline to run with specified input and output directories. From the command line, --input_dir and --output_dir control this.

There's a couple known bugs that we need to squash, but the general framework is ready. David's use case A above works well already - I work this way myself frequently using --output_dir.

stscijgbot commented 5 years ago

Comment by David Law: [~jdavies] good to know, thanks!

stscijgbot commented 5 years ago

Comment by Alicia Canipe: More use cases for MIRI: MRS users will want to be able to reprocess their data, especially for mosaics. TSO is another case where the users will want to reprocess.

stscijgbot commented 5 years ago

Comment by Alicia Canipe: Just to capture NIRSpec's feedback: {quote}The simple answer to the question, what are the top 5 cases for reprocessing data, is yes, NIRSPEC MOS data will definitely be reprocessed by users and a flat directory structure would make that easier.

But that doesn’t necessarily mean we want to get rid of the organization by target/source option. Being able to download L3 products by target/source is going to make it a lot easier for MOS users to quicklook their data and use the DA Tools being developed. But then, things get confusing when you try to organize L2 products by target/source, since the level 2a & 2b products contain multiple sources, unless MAST ends up duplicating instances of those products. If target=pointing, it still won't work in general since many programs will observe sources in different MSA configurations with different pointings (for example, a non-nod dither). NIRSpec would like to get more information from MAST on what their plans are, is there a contact person? {quote} and Dick Shaw's response: {quote}To Cheryl’s question, we’ve got as far as identifying the problem: i.e., a mis-match between the file organization within bundles as delivered by MAST, and the flat file organization needed to reprocess data with the CAL pipeline. We in DMS have a fair idea of the solution space, and the sense that it would be great if the user didn’t actually have to worry about the organization of files on their local disk. The case of NIRSpec MOS data reprocessing might help us get to a solution that works for everyone. {quote}

stscijgbot commented 5 years ago

Comment by Ben Sargent: Use case for NIRSpec IFU: prior to the cube-building step, there may be a "trial and error" aspect to flagging/removing pixels in NIRSpec slope images affected by light from contaminating sources that leak through the MSA. A user may spend significant time on their own computer flagging a set of pixels to ignore in cube-building, building the cube, unflagging already-flagged pixels and flagging new ones, then trying cube-building again (with potential further iterations of this procedure), to try to reduce/eliminate contamination from sources that leak through the MSA.

stscijgbot commented 5 years ago

Comment by Ben Sargent: Use case for MIRI-MRS and/or NIRSpec IFU: a user may want to set user-selectable parameters for cube-building different from the defaults, and then re-build the cubes.

stscijgbot commented 5 years ago

Comment by Richard A. Shaw: Questions for the hive mind:

To what extent should we expect users to want to re-process starting from the beginning (L-1b for them), vs. reprocess from L-2b to L-3? I would guess the answer might be very different for instrument scientists vs. astronomers in the wild. The answer may also be different early in Cy 1, vs. subsequent cycles.

Apart from cases like extraction or background subtraction for MOS spectra; subtraction of a PSF from coronographic observations; or creating dither-combined mosaics from an incomplete imaging survey, what are the most likely cases where re-running the CAL pipeline would be advantageous for a general user?

stscijgbot commented 5 years ago

Comment by Beth Sargent: Use case for NIRSpec IFU: prior to the cube-building step, there may be a "trial and error" aspect to flagging/removing pixels in NIRSpec slope images affected by light from contaminating sources that leak through the MSA. A user may spend significant time on their own computer flagging a set of pixels to ignore in cube-building, building the cube, unflagging already-flagged pixels and flagging new ones, then trying cube-building again (with potential further iterations of this procedure), to try to reduce/eliminate contamination from sources that leak through the MSA.

stscijgbot commented 5 years ago

Comment by Beth Sargent: Use case for MIRI-MRS and/or NIRSpec IFU: a user may want to set user-selectable parameters for cube-building different from the defaults, and then re-build the cubes.

stscijgbot commented 4 years ago

Comment by Alicia Canipe: Feedback was provided and submitted.

spacetelescope / jwst

Request feedback on how data products are bundled in MAST/Portal #3547