Closed ell-bogat closed 2 weeks ago
Okay I think this PR is ready for review! Here are the things that have changed since I opened the PR:
FWC_EM
-> FWC_EM_E
full well capacity in em gain register in electronsFWC_PP
-> FWC_PP_E
full well capacity in image array, before em gain, in electronsSAT_FWC
-> SAT_DN
saturation threshold used in dNCentralizing the DQ flag definitions and ways to interact with the DQ array will be in a separate PR :)
Let me know what you think!
@kjl0025 given this is a big port of II&T, you may want to look at this as well.
As for Ell's questions:
--Should we have a dictionary of the DQ flags stored centrally somewhere in corgidrp in case they change? I think we should have a file called something like dq_vals.yaml in a folder called util
, and the file would contain the numerical values (to avoid hardcoding numbers in the source code).
--Should we decode the incoming DQ array to see if any pixels have already been manually flagged for saturation/CRs, or assert that the incoming DQ array should be all zeros? This is the first official step in the pipeline where we touch the DQ array. I think we should decode it to avoid double-counting, as you suggested in a comment in your code. If you want to do that in this branch, great. If not, you can make a new issue about that.
adding a yaml file is also part of the source code. I don't see the difference. In my opinion, adding more files just creates more possible ways we can have path error bugs given the various ways python installs packages.
There will be default .yaml files in the source code, but a user can input different .yaml files from a personal directory (not from source code) to the function in the Docker image if different settings are desired. This was our standard practice with GSW.
For reproducibility purposes, that means we have to track an extra input in order to reproduce a given pipeline reduction. That seems less ideal to me. I don't see the need for redefining DQ values.
If we had different values like this, then I think we'd have to encode them in the headers, which I agree would be less than I deal. Kevin, do you mind describing a case where we might redefine what the specific bit flags mean? If there's a clear reason that we might do this then I'm game to consider it, but off the top of my head I don't quite see why.
--Should we decode the incoming DQ array to see if any pixels have already been manually flagged for saturation/CRs, or assert that the incoming DQ array should be all zeros? This is the first official step in the pipeline where we touch the DQ array. I think we should decode it to avoid double-counting, as you suggested in a comment in your code. If you want to do that in this branch, great. If not, you can make a new issue about that.
I think we should 100% do this, rather than assume that they are all zeros. This is also relevant for PR #93, @hsergi. I'll make a comment where it's relevant.
You would just need the date, which is already tracked in the ext_hdr. We should be able to add calibration files to the caldb that are dated, right? So the Walker picks out the relevant calibration files according to date, so I figured a similar thing could be done for other detector properties which may change over the course of years (FWC, k gain, charge traps, etc). Many will not, but I think a good standard practice is to have maximum flexibility and to hardcode no numbers. (As for the DQ values in particular, in practice, I agree that there's no reason why those should change, so they wouldn't necessarily need their config file. My bent, though, is to keep all numbers in config files because of the nature of Docker.)
To be honest, I'm not even quite sure what we're talking about here. Right now, we have 3rd bit represents a hot pixel (according to the implementation doc). We we suggesting that one day we might switch this to be for example the 4th bit?
--Should we decode the incoming DQ array to see if any pixels have already been manually flagged for saturation/CRs, or assert that the incoming DQ array should be all zeros? This is the first official step in the pipeline where we touch the DQ array. I think we should decode it to avoid double-counting, as you suggested in a comment in your code. If you want to do that in this branch, great. If not, you can make a new issue about that.
I think we should 100% do this, rather than assume that they are all zeros. This is also relevant for PR #93, @hsergi. I'll make a comment where it's relevant.
I think this can be done pretty easily in only one or two lines with numpy.packbits
, numpy.unpackbits
and numpy.bitwise_or
.
To be honest, I'm not even quite sure what we're talking about here. Right now, we have 3rd bit represents a hot pixel (according to the implementation doc). We we suggesting that one day we might switch this to be for example the 4th bit?
When I raised the question I was thinking more that the DQ flags might evolve slightly just while we're developing the pipeline. (edit: sp)
I still don't quite understand what we mean by evolve here? That the saturation value might change? Yes. that may change. But changing that bit 5 means saturated? almost certainly not.
Gotcha, we can probably leave it hardcoded then.
You would just need the date, which is already tracked in the ext_hdr. We should be able to add calibration files to the caldb that are dated, right? So the Walker picks out the relevant calibration files according to date, so I figured a similar thing could be done for other detector properties which may change over the course of years (FWC, k gain, charge traps, etc). Many will not, but I think a good standard practice is to have maximum flexibility and to hardcode no numbers. (As for the DQ values in particular, in practice, I agree that there's no reason why those should change, so they wouldn't necessarily need their config file. My bent, though, is to keep all numbers in config files because of the nature of Docker.)
This feels like to me it's just creating extra overhead for not much increase in flexibility. It's basically making a second calibration tracking system and requires us to have config files in specific directories, which creates extra configuration issues (recall this pipeline is not just for CTC, it's also for general science users, and we'll need to support both). I think we should keep them hard-coded for now or use functions like we are doing with FWC and kgain.
I'm just imagining scenarios where having the ability to input custom parameters would be useful. Say the k gain is calibrated in space, and the value is slightly different from what we got in TVAC. If the functions only allow for the k gain number to be drawn from source code, then a user could not use that updated number without a version update and re-release of the Docker image (assuming that would happen?). Now that I think about this again, how about something like this: We allow for a config .yaml file as an optional function input for any function requiring parameter inputs so that a user could use, for example, the updated k gain number without waiting for a new release (if that would happen), and if no .yaml file is specified in the input, the code just uses Ell's parameter retrieval functions, which could have numbers in them, one of which is collected based on the timestamp in the ext_hdr of the input dataset. @mygouf , what do you think about this? And none of these parameters are export-controlled and are already exposed in the II&T code that is open-source.
This sounds like a case where we are still analyzing the data as a user, not doing the final processing of the data at the CTC. Any user can modify their version of the pipeline to change the value of the kgain (modifying a python file is the same as modifying a yaml file). When we are happy with a number, we would just do a pipeline re-release and the new version deployed on Docker at the CTC which will do the official processing. I don't see us changing the default CTC reduction on the fly -- that would be less than ideal, and prone to mistakes.
With .yaml input, the benefit is that a user can specify a custom file on his/her own computer as an input for the Docker code (whereas he/she can't change a source-code .yaml file or script file once the code is in Docker). But it sounds like you're saying that everyday users will never be in the situation where they are stuck with a Docker image only? Maybe I am misunderstanding the purpose of the Docker release. So the CTC will be using the Docker? And I guess the point is that we don't want the CTC to be able to change any parameters since what they're given is supposed to be correct?
Yes users can just clone this repository!
The CTC has expressed some interest in Docker, but I personally am not assuming that we won't need to tweak things once we have real on-sky data....
(I may not have the same assumptions as CTC...)
Docker would only be for the CTC (if they choose to go down that route). Everyday users will be running the DRP via cloning the Github or install via pip/pypi.
Max, Vanessa, Marie, and I chatted about the yaml file/python function decision. We think the best option may be a third option: to use the calibration file architecture for these detector properties that may change slowly over the years. This is basically our version of the yaml files, but doesn't require us to setup a different tracking system (detector properties will just be another calibration file we can generate as needed). The only con is that they are a bit more difficult to modify than yaml files, but the plan will be to write a function that generates these calibration files, so you don't have to format a FITS file each time. Let's do this on a separate branch/PR though.
@semaphoreP I think it's ready to merge!
This pull request adds a step that does both cosmic ray flagging and saturation flagging.
New code:
detect_cosmic_rays()
input_dataset
to process, the multiplication factors for the full-well capacity that determines the threshold for saturation (sat_thresh
, defaults to 0.99) and for cosmic ray plateaus (plat_thresh
, defaults to 0.85), and the minimum length of the cosmic ray plateau that it will detect (cosm_filter
, defaults to 2).fwc_em
,fwc_pp * em_gain
]. Currently asserts that these values must be the same for all the frames in the dataset.sat_thresh
) as'SAT_FWC'
in the frame header. Probably need to specify units somehow, or maybe this isn't important to save?'EM_GAIN'
,'FWC_EM'
, and'FWC_PP'
will be available as keywords in the frame header.flag_cosmics()
from II&T code to flag cosmic rays.find_plateaus()
from II&T code to identify CR plateaus in individual rows.create_cr_dataset()
that creates mock data with a specified number of CR hits distributed randomly in the data, modeled as a high plateau followed by an exponential decay tail. The user can also specify the EM gain, EM full well capacity, and "pp" full well capacity (i.e. the FWC before the EM gain), which will be stored in the header.Example:
Mock data with CRs:![image](https://github.com/roman-corgi/corgidrp/assets/74611610/aeb0a550-48bd-4b35-bca3-e28cb4e9fbc1)
Data after saturation and CR flagging:
*Note: the DQ values look lower than they should be in the image b/c of matplotlib interpolation, but they're correct in the array!
Questions:
'EM_GAIN'
,'FWC_EM'
, and'FWC_PP'
will be provided in the frame header?