L1 to l2a CR Rejection - Githubissues

ell-bogat commented 1 month ago

This pull request adds a step that does both cosmic ray flagging and saturation flagging.

New code:

Changes to l1_to_l2a.py
- Added step function detect_cosmic_rays()
  - Takes an input_dataset to process, the multiplication factors for the full-well capacity that determines the threshold for saturation (sat_thresh, defaults to 0.99) and for cosmic ray plateaus (plat_thresh, defaults to 0.85), and the minimum length of the cosmic ray plateau that it will detect (cosm_filter, defaults to 2).
  - Calculates the full well capacity using the minimum of [fwc_em , fwc_pp * em_gain]. Currently asserts that these values must be the same for all the frames in the dataset.
  - Saves the saturation threshold (the full well capacity times sat_thresh ) as 'SAT_FWC' in the frame header. Probably need to specify units somehow, or maybe this isn't important to save?
  - Assumes that 'EM_GAIN', 'FWC_EM', and 'FWC_PP' will be available as keywords in the frame header.
  - Populates the DQ map for saturated pixels
  - Populates the DQ map for pixels affected by cosmic ray hits (starting 1 pixel to the left of the CR plateau, and extending to the end of the affected row).
  - Does not replace any pixels in the data.
Changes to detector.py
- Added a helper function flag_cosmics() from II&T code to flag cosmic rays.
- Added a helper function find_plateaus() from II&T code to identify CR plateaus in individual rows.
Changes to mocks.py
- Added a function create_cr_dataset() that creates mock data with a specified number of CR hits distributed randomly in the data, modeled as a high plateau followed by an exponential decay tail. The user can also specify the EM gain, EM full well capacity, and "pp" full well capacity (i.e. the FWC before the EM gain), which will be stored in the header.
New file: tests/test_cr_detection.py
- Added unit tests ported from II&T code as well as tests ensuring that the corgidrp and II&T code produce the same result.

Example:

Mock data with CRs:

Data after saturation and CR flagging: *Note: the DQ values look lower than they should be in the image b/c of matplotlib interpolation, but they're correct in the array!

Questions:

Is there any reason to save the saturation threshold in the header of each frame?
Can we assume/assert that 'EM_GAIN', 'FWC_EM', and 'FWC_PP' will be provided in the frame header?
Should we enforce that the above values have to be the same for all the frames, or treat the case where they may all be different?
Should we have a dictionary of the DQ flags stored centrally somewhere in corgidrp in case they change?
Should we decode the incoming DQ array to see if any pixels have already been manually flagged for saturation/CRs, or assert that the incoming DQ array should be all zeros? This is the first official step in the pipeline where we touch the DQ array.

ell-bogat commented 1 month ago

Okay I think this PR is ready for review! Here are the things that have changed since I opened the PR:

Added some functions to fetch the pixel full well capacity (before and after the EM gain step) and the K gain based on the observation datetime.
Added some clarification to the header keywords so its clearer what the units are (dN vs electrons).
- FWC_EM -> FWC_EM_E full well capacity in em gain register in electrons
- FWC_PP -> FWC_PP_E full well capacity in image array, before em gain, in electrons
- SAT_FWC -> SAT_DN saturation threshold used in dN
Added ability to handle datasets where each frame has a different observation datetime/emgain/fwc_em/fwc_pp/kgain, etc.

Centralizing the DQ flag definitions and ways to interact with the DQ array will be in a separate PR :)

Let me know what you think!

semaphoreP commented 3 weeks ago

@kjl0025 given this is a big port of II&T, you may want to look at this as well.

kjl0025 commented 3 weeks ago

As for Ell's questions: --Should we have a dictionary of the DQ flags stored centrally somewhere in corgidrp in case they change? I think we should have a file called something like dq_vals.yaml in a folder called util, and the file would contain the numerical values (to avoid hardcoding numbers in the source code). --Should we decode the incoming DQ array to see if any pixels have already been manually flagged for saturation/CRs, or assert that the incoming DQ array should be all zeros? This is the first official step in the pipeline where we touch the DQ array. I think we should decode it to avoid double-counting, as you suggested in a comment in your code. If you want to do that in this branch, great. If not, you can make a new issue about that.

semaphoreP commented 3 weeks ago

adding a yaml file is also part of the source code. I don't see the difference. In my opinion, adding more files just creates more possible ways we can have path error bugs given the various ways python installs packages.

kjl0025 commented 3 weeks ago

There will be default .yaml files in the source code, but a user can input different .yaml files from a personal directory (not from source code) to the function in the Docker image if different settings are desired. This was our standard practice with GSW.

semaphoreP commented 3 weeks ago

For reproducibility purposes, that means we have to track an extra input in order to reproduce a given pipeline reduction. That seems less ideal to me. I don't see the need for redefining DQ values.

maxwellmb commented 3 weeks ago

If we had different values like this, then I think we'd have to encode them in the headers, which I agree would be less than I deal. Kevin, do you mind describing a case where we might redefine what the specific bit flags mean? If there's a clear reason that we might do this then I'm game to consider it, but off the top of my head I don't quite see why.

maxwellmb commented 3 weeks ago

--Should we decode the incoming DQ array to see if any pixels have already been manually flagged for saturation/CRs, or assert that the incoming DQ array should be all zeros? This is the first official step in the pipeline where we touch the DQ array. I think we should decode it to avoid double-counting, as you suggested in a comment in your code. If you want to do that in this branch, great. If not, you can make a new issue about that.

I think we should 100% do this, rather than assume that they are all zeros. This is also relevant for PR #93, @hsergi. I'll make a comment where it's relevant.

kjl0025 commented 3 weeks ago

You would just need the date, which is already tracked in the ext_hdr. We should be able to add calibration files to the caldb that are dated, right? So the Walker picks out the relevant calibration files according to date, so I figured a similar thing could be done for other detector properties which may change over the course of years (FWC, k gain, charge traps, etc). Many will not, but I think a good standard practice is to have maximum flexibility and to hardcode no numbers. (As for the DQ values in particular, in practice, I agree that there's no reason why those should change, so they wouldn't necessarily need their config file. My bent, though, is to keep all numbers in config files because of the nature of Docker.)

maxwellmb commented 3 weeks ago

To be honest, I'm not even quite sure what we're talking about here. Right now, we have 3rd bit represents a hot pixel (according to the implementation doc). We we suggesting that one day we might switch this to be for example the 4th bit?

maxwellmb commented 3 weeks ago

--Should we decode the incoming DQ array to see if any pixels have already been manually flagged for saturation/CRs, or assert that the incoming DQ array should be all zeros? This is the first official step in the pipeline where we touch the DQ array. I think we should decode it to avoid double-counting, as you suggested in a comment in your code. If you want to do that in this branch, great. If not, you can make a new issue about that.

I think we should 100% do this, rather than assume that they are all zeros. This is also relevant for PR #93, @hsergi. I'll make a comment where it's relevant.

I think this can be done pretty easily in only one or two lines with numpy.packbits, numpy.unpackbits and numpy.bitwise_or.

ell-bogat commented 3 weeks ago

To be honest, I'm not even quite sure what we're talking about here. Right now, we have 3rd bit represents a hot pixel (according to the implementation doc). We we suggesting that one day we might switch this to be for example the 4th bit?

When I raised the question I was thinking more that the DQ flags might evolve slightly just while we're developing the pipeline. (edit: sp)

maxwellmb commented 3 weeks ago

I still don't quite understand what we mean by evolve here? That the saturation value might change? Yes. that may change. But changing that bit 5 means saturated? almost certainly not.

ell-bogat commented 3 weeks ago

Gotcha, we can probably leave it hardcoded then.

semaphoreP commented 3 weeks ago

You would just need the date, which is already tracked in the ext_hdr. We should be able to add calibration files to the caldb that are dated, right? So the Walker picks out the relevant calibration files according to date, so I figured a similar thing could be done for other detector properties which may change over the course of years (FWC, k gain, charge traps, etc). Many will not, but I think a good standard practice is to have maximum flexibility and to hardcode no numbers. (As for the DQ values in particular, in practice, I agree that there's no reason why those should change, so they wouldn't necessarily need their config file. My bent, though, is to keep all numbers in config files because of the nature of Docker.)

This feels like to me it's just creating extra overhead for not much increase in flexibility. It's basically making a second calibration tracking system and requires us to have config files in specific directories, which creates extra configuration issues (recall this pipeline is not just for CTC, it's also for general science users, and we'll need to support both). I think we should keep them hard-coded for now or use functions like we are doing with FWC and kgain.

kjl0025 commented 3 weeks ago

I'm just imagining scenarios where having the ability to input custom parameters would be useful. Say the k gain is calibrated in space, and the value is slightly different from what we got in TVAC. If the functions only allow for the k gain number to be drawn from source code, then a user could not use that updated number without a version update and re-release of the Docker image (assuming that would happen?). Now that I think about this again, how about something like this: We allow for a config .yaml file as an optional function input for any function requiring parameter inputs so that a user could use, for example, the updated k gain number without waiting for a new release (if that would happen), and if no .yaml file is specified in the input, the code just uses Ell's parameter retrieval functions, which could have numbers in them, one of which is collected based on the timestamp in the ext_hdr of the input dataset. @mygouf , what do you think about this? And none of these parameters are export-controlled and are already exposed in the II&T code that is open-source.

semaphoreP commented 3 weeks ago

This sounds like a case where we are still analyzing the data as a user, not doing the final processing of the data at the CTC. Any user can modify their version of the pipeline to change the value of the kgain (modifying a python file is the same as modifying a yaml file). When we are happy with a number, we would just do a pipeline re-release and the new version deployed on Docker at the CTC which will do the official processing. I don't see us changing the default CTC reduction on the fly -- that would be less than ideal, and prone to mistakes.

kjl0025 commented 3 weeks ago

With .yaml input, the benefit is that a user can specify a custom file on his/her own computer as an input for the Docker code (whereas he/she can't change a source-code .yaml file or script file once the code is in Docker). But it sounds like you're saying that everyday users will never be in the situation where they are stuck with a Docker image only? Maybe I am misunderstanding the purpose of the Docker release. So the CTC will be using the Docker? And I guess the point is that we don't want the CTC to be able to change any parameters since what they're given is supposed to be correct?

maxwellmb commented 3 weeks ago

Yes users can just clone this repository!

The CTC has expressed some interest in Docker, but I personally am not assuming that we won't need to tweak things once we have real on-sky data....

maxwellmb commented 3 weeks ago

(I may not have the same assumptions as CTC...)

semaphoreP commented 3 weeks ago

Docker would only be for the CTC (if they choose to go down that route). Everyday users will be running the DRP via cloning the Github or install via pip/pypi.

semaphoreP commented 3 weeks ago

Max, Vanessa, Marie, and I chatted about the yaml file/python function decision. We think the best option may be a third option: to use the calibration file architecture for these detector properties that may change slowly over the years. This is basically our version of the yaml files, but doesn't require us to setup a different tracking system (detector properties will just be another calibration file we can generate as needed). The only con is that they are a bit more difficult to modify than yaml files, but the plan will be to write a function that generates these calibration files, so you don't have to format a FITS file each time. Let's do this on a separate branch/PR though.

ell-bogat commented 2 weeks ago

@semaphoreP I think it's ready to merge!

roman-corgi / corgidrp

L1 to l2a CR Rejection #86