`group_id` in romancal - Githubissues

braingram commented 1 month ago

For some operations (tweakreg, skymatch, resample) models are grouped based on a group_id.

This is currently defined in ModelContainer.models_grouped: https://github.com/spacetelescope/romancal/blob/ed6187ffa790fc6a37dd96f7365ba1ef8261fa74/romancal/datamodels/container.py#L498-L504 which assigns a computed value to model.meta.group_id

models_grouped is used in:

resample many_to_many and many_to_one and update_exposure_times (some of which is used in outlier detection)
skymatch
tweakreg

Currently group id is always defined in models_grouped ignoring any group_id in the model metadata (and ignoring any group_id in the association, which is used for jwst).

Most of the above steps are part of the mosaic pipeline with the only exception being tweakreg which is part of the exposure level pipeline.

There is no test for the exposure level pipeline that doesn't skip tweakreg so it is difficult to guess what should be done here.

For the mosaic pipeline it is possible the mosaic will contain models from different groups which will result in:

outlier detection using the median across groups to detect outlier (however in practice this isn't occurring as outlier detection it taking a median with a N=1)
skymatch can group models with the same group id into SkyGroup instances (which have their sky adjusted together). However this appears to be partially supported where skygroups are never created and instead the sky for each model is adjusted separately
resample will call many_to_one (it appears the single argument is not passed to ResampleData so this is likely always the case). During many_to_one each model is drizzled separately however the grouping is used for getting the number of pointings and for updating the exposure times.

Some questions I have:

Should outlier detection be using many to many and using the groups?
Should skymatch be adjusting the sky of models from the same group together (like is done for jwst)?
Should resample be using the groupings to determine the exposure time and number of pointings?

nden commented 1 month ago

group_id is cheaper to compute from the pool file and add to the association than to compute it when it's needed in the L3 pipeline. All the information needed to compute it should be in the pool file. Adding it will remove a lot of opening and closing of files in the Level 3 pipeline. The fact that its use hasn't been formalized is a problem in both pipelines. JWST is moving towards formalizing its use and an example jwst association is here

https://github.com/spacetelescope/jwst/blob/master/jwst/datamodels/container.py#L102-L129

Is there a reason not to take advantage of this?

nden commented 1 month ago

@stscijgbot-rstdms

stscijgbot-rstdms commented 1 month ago

This issue is tracked on JIRA as RCAL-852.

schlafly commented 1 month ago

Thanks Brett. Broadly, I suspect that we should use this, with exposures defining the group; it would sound like that would mean adding group_id = L2 file name minus the suffix and the SCA name.

Some more specific comments:

in outlier detection, I suspect that in terms of the final output this doesn't matter, but practically using the groups will result in denser resampled images contributing to the median filtering (with the caveat, as you say, that the current median filtering isn't right)
Our goal for the sky matching was to match SCAs and not to match the whole array, so that "sounds right" to me for the moment. i.e., we wouldn't want to make SkyGroups that merge the whole array together. But I think I understand you to be saying that if we set these group ids that presently undesirable behavior wouldn't happen automatically.
For resample, it sounds like the exposure times will be fine either way---since the exposure times for the different SCAs of the same exposure are the same, we don't really care whether those max's and min's are grouped or ungrouped. The number of pointings would be more natural if we set the groups.

braingram commented 4 weeks ago

Thanks! To check that I'm following the discussion so far:

outlier detection could likely use grouping based on a pattern similar to the L2 file name
grouping is not needed in skymatch as the skys for SCAs will be matched separately. As it appears the group ids are already being ignored here I think this matches what is currently being done.
resample might be the other reason (aside from outlier detection) to keep the grouping API in the container (as it's currently used in this step)

Since there are 2 uses for grouping (outlier detection and resample) and the code currently expects the grouping to be done in the container it sound like it makes sense to keep the grouping API on the container. This also has the benefit that if another step wants to used the groupings it's already part of the common code (as it is now with ModelContainer).

One additional question is do we want to support having the group id in the association? As @nden mentioned this is much more efficient (currently models_grouped will open every model as that's the only way to determine the grouping). Allowing this would also make it easier to make the container part of the common jwst/romancal code (perhaps in stpipe or stcal) as jwst allows setting the group id in the association. Allowing group id to be defined in the association would mean that a user could modify the group id by changing the association (although the user could currently modify the group id by changing the files). I don't see this last point as a negative and given the other benefits do you think it's worth allowing the group id to be defined in the association?

braingram commented 4 weeks ago

Looking at the group id calculation the current ModelContainer generates a group id that doesn't quite match the l2 filenames (ignoring the SCA name and suffix).

in l2 filename	in group id
program id	✔️
execution plan	❌
pass	❌
segment	❌
visit number	✔️
visit group	✔️
sequence id	✔️
activity	✔️
exposure	✔️

Looking at the RAD schemas, would using meta.observation.obs_id make sense: https://github.com/spacetelescope/rad/blob/5ccb720777ed90faf13897ef92adf738693d8024/src/rad/resources/schemas/observation-1.0.0.yaml#L9 Is this used and kept up-to-date? If so it looks to be the same as the l2 filename minus the SCA and suffix.

schlafly commented 4 weeks ago

outlier detection could likely use grouping based on a pattern similar to the L2 file name

grouping is not needed in skymatch as the skys for SCAs will be matched separately. As it appears the group ids are already being ignored here I think this matches what is currently being done.

resample might be the other reason (aside from outlier detection) to keep the grouping API in the container (as it's currently used in this step)

Yes. Re sky-matching, we're not intending to fit a single offset per exposure at present, but there is a lot of uncertainty about what level of flexibility the sky matching will end up needing to have. So if it's very easy to have an option to either group by exposure or do individual SCAs, that would be nice. But it sounds like we're following our intended baseline.

One additional question is do we want to support having the group id in the association? As @nden mentioned this is much more efficient (currently models_grouped will open every model as that's the only way to determine the grouping). Allowing this would also make it easier to make the container part of the common jwst/romancal code (perhaps in stpipe or stcal) as jwst allows setting the group id in the association. Allowing group id to be defined in the association would mean that a user could modify the group id by changing the association (although the user could currently modify the group id by changing the files). I don't see this last point as a negative and given the other benefits do you think it's worth allowing the group id to be defined in the association?

That's fine; I don't feel strongly here. The only vague awkwardness about defining it in the file is that I really don't see many cases where we want the group to be something other than an exposure. So there's a little redundancy, but if that redundancy leads us to match Webb it's probably worth it.

I can confirm that recently all L1 files produced by romanisim started including obs_id, and that it looks to me to include all we need to identify an exposure. A potential implementation concern is that it would be easy to try to make some fake exposures by copying real exposures and changing the file names, believing that one was generating exposures that would be treated as separate exposures. But if we fail to update obs_id you could be surprised when they're all part of the same exposure. Or, relatedly, we keep the same information in the actual file name, in meta.filename, and in obsid, and it can be hard to keep these in sync. So I might prefer a default group based on the first N characters of the file name. But I don't feel strongly here.

braingram commented 4 weeks ago

Keeping things in sync is definitely a concern. Perhaps this is off topic but it looks like obs_id is not used in romancal: https://github.com/search?q=repo%3Aspacetelescope%2Fromancal+obs_id&type=code Since this is a combination of other metadata is there a reason it's a separate metadata entry (and not a dynamic property of the datamodel)? The same is true for visit_id and in some sense meta.filename could also be dynamic (perhaps a filename property of the datamodel). This might help to avoid issues like https://github.com/spacetelescope/romancal/issues/818

schlafly commented 4 weeks ago

Right, obs_id is set in romanisim during the creation of the L1 files and then not changed by romancal. Yes, I would be sympathetic to removing meta.filename, but I'm told stpipe cares about it and so that would be a bigger lift. I do think there's a use to having an overall exposure id (obs_id presently?) in the downstream archive, etc., so I'm not really militating for getting rid of obs_id.

braingram commented 4 weeks ago

Thanks! Good point about stpipe and meta.filename. Let's leave any changes to that to a separate issue. It's certainly confusing to me.

spacetelescope / romancal

`group_id` in romancal #1259