spacetelescope / romancal

Python library to process science observations from the Nancy Grace Roman Space Telescope
https://roman-pipeline.readthedocs.io/en/latest/
Other
28 stars 28 forks source link

`group_id` in romancal #1259

Open braingram opened 1 month ago

braingram commented 1 month ago

For some operations (tweakreg, skymatch, resample) models are grouped based on a group_id.

This is currently defined in ModelContainer.models_grouped: https://github.com/spacetelescope/romancal/blob/ed6187ffa790fc6a37dd96f7365ba1ef8261fa74/romancal/datamodels/container.py#L498-L504 which assigns a computed value to model.meta.group_id

models_grouped is used in:

Currently group id is always defined in models_grouped ignoring any group_id in the model metadata (and ignoring any group_id in the association, which is used for jwst).

Most of the above steps are part of the mosaic pipeline with the only exception being tweakreg which is part of the exposure level pipeline.

There is no test for the exposure level pipeline that doesn't skip tweakreg so it is difficult to guess what should be done here.

For the mosaic pipeline it is possible the mosaic will contain models from different groups which will result in:

Some questions I have:

nden commented 1 month ago

group_id is cheaper to compute from the pool file and add to the association than to compute it when it's needed in the L3 pipeline. All the information needed to compute it should be in the pool file. Adding it will remove a lot of opening and closing of files in the Level 3 pipeline. The fact that its use hasn't been formalized is a problem in both pipelines. JWST is moving towards formalizing its use and an example jwst association is here

https://github.com/spacetelescope/jwst/blob/master/jwst/datamodels/container.py#L102-L129

Is there a reason not to take advantage of this?

nden commented 1 month ago

@stscijgbot-rstdms

stscijgbot-rstdms commented 1 month ago

This issue is tracked on JIRA as RCAL-852.

schlafly commented 1 month ago

Thanks Brett. Broadly, I suspect that we should use this, with exposures defining the group; it would sound like that would mean adding group_id = L2 file name minus the suffix and the SCA name.

Some more specific comments:

braingram commented 4 weeks ago

Thanks! To check that I'm following the discussion so far:

Since there are 2 uses for grouping (outlier detection and resample) and the code currently expects the grouping to be done in the container it sound like it makes sense to keep the grouping API on the container. This also has the benefit that if another step wants to used the groupings it's already part of the common code (as it is now with ModelContainer).

One additional question is do we want to support having the group id in the association? As @nden mentioned this is much more efficient (currently models_grouped will open every model as that's the only way to determine the grouping). Allowing this would also make it easier to make the container part of the common jwst/romancal code (perhaps in stpipe or stcal) as jwst allows setting the group id in the association. Allowing group id to be defined in the association would mean that a user could modify the group id by changing the association (although the user could currently modify the group id by changing the files). I don't see this last point as a negative and given the other benefits do you think it's worth allowing the group id to be defined in the association?

braingram commented 4 weeks ago

Looking at the group id calculation the current ModelContainer generates a group id that doesn't quite match the l2 filenames (ignoring the SCA name and suffix).

in l2 filename in group id
program id ✔️
execution plan
pass
segment
visit number ✔️
visit group ✔️
sequence id ✔️
activity ✔️
exposure ✔️

Looking at the RAD schemas, would using meta.observation.obs_id make sense: https://github.com/spacetelescope/rad/blob/5ccb720777ed90faf13897ef92adf738693d8024/src/rad/resources/schemas/observation-1.0.0.yaml#L9 Is this used and kept up-to-date? If so it looks to be the same as the l2 filename minus the SCA and suffix.

schlafly commented 4 weeks ago
  • outlier detection could likely use grouping based on a pattern similar to the L2 file name
  • grouping is not needed in skymatch as the skys for SCAs will be matched separately. As it appears the group ids are already being ignored here I think this matches what is currently being done.
  • resample might be the other reason (aside from outlier detection) to keep the grouping API in the container (as it's currently used in this step)

Yes. Re sky-matching, we're not intending to fit a single offset per exposure at present, but there is a lot of uncertainty about what level of flexibility the sky matching will end up needing to have. So if it's very easy to have an option to either group by exposure or do individual SCAs, that would be nice. But it sounds like we're following our intended baseline.

One additional question is do we want to support having the group id in the association? As @nden mentioned this is much more efficient (currently models_grouped will open every model as that's the only way to determine the grouping). Allowing this would also make it easier to make the container part of the common jwst/romancal code (perhaps in stpipe or stcal) as jwst allows setting the group id in the association. Allowing group id to be defined in the association would mean that a user could modify the group id by changing the association (although the user could currently modify the group id by changing the files). I don't see this last point as a negative and given the other benefits do you think it's worth allowing the group id to be defined in the association?

That's fine; I don't feel strongly here. The only vague awkwardness about defining it in the file is that I really don't see many cases where we want the group to be something other than an exposure. So there's a little redundancy, but if that redundancy leads us to match Webb it's probably worth it.

I can confirm that recently all L1 files produced by romanisim started including obs_id, and that it looks to me to include all we need to identify an exposure. A potential implementation concern is that it would be easy to try to make some fake exposures by copying real exposures and changing the file names, believing that one was generating exposures that would be treated as separate exposures. But if we fail to update obs_id you could be surprised when they're all part of the same exposure. Or, relatedly, we keep the same information in the actual file name, in meta.filename, and in obsid, and it can be hard to keep these in sync. So I might prefer a default group based on the first N characters of the file name. But I don't feel strongly here.

braingram commented 4 weeks ago

Keeping things in sync is definitely a concern. Perhaps this is off topic but it looks like obs_id is not used in romancal: https://github.com/search?q=repo%3Aspacetelescope%2Fromancal+obs_id&type=code Since this is a combination of other metadata is there a reason it's a separate metadata entry (and not a dynamic property of the datamodel)? The same is true for visit_id and in some sense meta.filename could also be dynamic (perhaps a filename property of the datamodel). This might help to avoid issues like https://github.com/spacetelescope/romancal/issues/818

schlafly commented 4 weeks ago

Right, obs_id is set in romanisim during the creation of the L1 files and then not changed by romancal. Yes, I would be sympathetic to removing meta.filename, but I'm told stpipe cares about it and so that would be a bigger lift. I do think there's a use to having an overall exposure id (obs_id presently?) in the downstream archive, etc., so I'm not really militating for getting rid of obs_id.

braingram commented 4 weeks ago

Thanks! Good point about stpipe and meta.filename. Let's leave any changes to that to a separate issue. It's certainly confusing to me.