roman-corgi / corgidrp

Data Reduction Pipeline for the Roman Coronagraph Instrument
BSD 3-Clause "New" or "Revised" License
5 stars 4 forks source link

Kgain/NL #116

Closed hsergi closed 3 months ago

hsergi commented 5 months ago

Porting Kgain and NL calibration functions to CORGI DRP, together with unit tests.

I have used CORGI DRP native data types, including the new class KGain. Original (internal) operations related to calibration are performed with numpy arrays.

All unit tests are passed.

hsergi commented 5 months ago

It is necessary to add FluxMap1024.fits add to tests/test_data for running the ut. It is a pupil image:

Flux1024_image

Zipped FITS file:

[Uploading FluxMap1024.fits.zip…]()

It is 8 Mb.

JuergenSchreiber commented 4 months ago

It is necessary to add FluxMap1024.fits add to tests/test_data for running the ut. It is a pupil image:

Flux1024_image

Zipped FITS file:

Uploading FluxMap1024.fits.zip…

It is 8 Mb.

Is it really necessary to upload a new file to the test_data or could also the existing test data like example_L1_input.fits be used?

hsergi commented 4 months ago

The file has been added (see Jason's changes). This particular convo is closed.

hsergi commented 4 months ago
  • The two main calibration functions should be reformatted to accept a single Dataset as input. This will require the function to sort the files into the previous stack_arr, (and stack_arr2 where appropriate). This should mostly just be header keyword sorting.

@maxwellmb Is there any specific example? I have some difficulty understanding this point. In IIT the variable was a stack of stacks of nd arrays and initially Kevin and I did a stack of stacks of Dataset (see make_frame and stack). What's undesirable with this option?

maxwellmb commented 4 months ago
  • The two main calibration functions should be reformatted to accept a single Dataset as input. This will require the function to sort the files into the previous stack_arr, (and stack_arr2 where appropriate). This should mostly just be header keyword sorting.

@maxwellmb Is there any specific example? I have some difficulty understanding this point. In IIT the variable was a stack of stacks of nd arrays and initially Kevin and I did a stack of stacks of Dataset (see make_frame and stack). What's undesirable with this option?

We don't yet have an example of this because these calibrations functions are the first time we're encountering this. It's undesirable because it requires the definition and maintenance of a new type of data class, when we'll ultimately have to have the code to sort the files somewhere anyway (so we might as well put it in here). It also doesn't conform to the standard dataset-in -> dataset-out function structure, which makes it harder (impossible?) to write recipes for.

hsergi commented 4 months ago
  • The two main calibration functions should be reformatted to accept a single Dataset as input. This will require the function to sort the files into the previous stack_arr, (and stack_arr2 where appropriate). This should mostly just be header keyword sorting.

@maxwellmb Is there any specific example? I have some difficulty understanding this point. In IIT the variable was a stack of stacks of nd arrays and initially Kevin and I did a stack of stacks of Dataset (see make_frame and stack). What's undesirable with this option?

We don't yet have an example of this because these calibrations functions are the first time we're encountering this. It's undesirable because it requires the definition and maintenance of a new type of data class, when we'll ultimately have to have the code to sort the files somewhere anyway (so we might as well put it in here). It also doesn't conform to the standard dataset-in -> dataset-out function structure, which makes it harder (impossible?) to write recipes for.

I would appreciate if someone can provide a first example: either @maxwellmb , @kjl0025 or someone more versed than I in corgidrp Dataset uses

hsergi commented 4 months ago

@semaphoreP @maxwellmb "The tests should be reformatted into pytests and be a bit more organized so that it's easier to follow, in particular the main functionality testing." Any specific example in the pipeline that ahs been recently merged that can be used as a guidance?

maxwellmb commented 4 months ago

@semaphoreP @maxwellmb "The tests should be reformatted into pytests and be a bit more organized so that it's easier to follow, in particular the main functionality testing." Any specific example in the pipeline that ahs been recently merged that can be used as a guidance?

You could look to basically all the exists tests for guidance. In terms of changing over unittests, perhaps @kjl0025 has gotten started on doing this for his PR?

maxwellmb commented 4 months ago
  • The two main calibration functions should be reformatted to accept a single Dataset as input. This will require the function to sort the files into the previous stack_arr, (and stack_arr2 where appropriate). This should mostly just be header keyword sorting.

@maxwellmb Is there any specific example? I have some difficulty understanding this point. In IIT the variable was a stack of stacks of nd arrays and initially Kevin and I did a stack of stacks of Dataset (see make_frame and stack). What's undesirable with this option?

We don't yet have an example of this because these calibrations functions are the first time we're encountering this. It's undesirable because it requires the definition and maintenance of a new type of data class, when we'll ultimately have to have the code to sort the files somewhere anyway (so we might as well put it in here). It also doesn't conform to the standard dataset-in -> dataset-out function structure, which makes it harder (impossible?) to write recipes for.

I would appreciate if someone can provide a first example: either @maxwellmb , @kjl0025 or someone more versed than I in corgidrp Dataset uses

I'm not sure this will actually work, but here's something to give you inspiration. Let's say you needed to sort your files by exposure time:

#Grab the exposure times
exposure_times = [frame[0].header['EXPTIME'] for frame in dataset.frames]
#Find the unique ones
unique_exposure_times = np.unique(exposure_times)
#Make a list of data 
stacked_array = [dataset.all_data[exposure_times == exp_time] for exp_time in unique_exposure_times]

You can use different logic to split up the dataset object how you please, but this is an example of how you might do that.

hsergi commented 4 months ago
  • The two main calibration functions should be reformatted to accept a single Dataset as input. This will require the function to sort the files into the previous stack_arr, (and stack_arr2 where appropriate). This should mostly just be header keyword sorting.

@maxwellmb Is there any specific example? I have some difficulty understanding this point. In IIT the variable was a stack of stacks of nd arrays and initially Kevin and I did a stack of stacks of Dataset (see make_frame and stack). What's undesirable with this option?

We don't yet have an example of this because these calibrations functions are the first time we're encountering this. It's undesirable because it requires the definition and maintenance of a new type of data class, when we'll ultimately have to have the code to sort the files somewhere anyway (so we might as well put it in here). It also doesn't conform to the standard dataset-in -> dataset-out function structure, which makes it harder (impossible?) to write recipes for.

I would appreciate if someone can provide a first example: either @maxwellmb , @kjl0025 or someone more versed than I in corgidrp Dataset uses

I'm not sure this will actually work, but here's something to give you inspiration. Let's say you needed to sort your files by exposure time:

#Grab the exposure times
exposure_times = [frame[0].header['EXPTIME'] for frame in dataset.frames]
#Find the unique ones
unique_exposure_times = np.unique(exposure_times)
#Make a list of data 
stacked_array = [dataset.all_data[exposure_times == exp_time] for exp_time in unique_exposure_times]

You can use different logic to split up the dataset object how you please, but this is an example of how you might do that.

@semaphoreP: thanks for volunteering in today's meeting to provide a template code tat can be used to deal with stack_array and stack_array2 in both Kgain and Nonlinearity, different to the approach I took of defining a stack of (stack of) Dataset Images.

hsergi commented 4 months ago

"The kgain_params and nonlin_params should be mostly kwargs to the calibration functions and if we think they'll be constants we can make kgain_params and nonlin_params in their respective modules (calibrate_kgain.py and calibrate_nonlin.py)."

@maxwellmb, @JuergenSchreiber This is done, tested and committed for both kgain and nonlinearity

JuergenSchreiber commented 4 months ago

I have updated the docstrings to our flake conventions and converted the unittests to pytests. Tests of kgain are not running since there is a function used in calibrate_kgain which has no import: count_contiguous_repeats(exp_time_loop)

hsergi commented 4 months ago

@mygouf, @maxwellmb , @semaphoreP: Unit tests are running (not passing all, but that's fine right now). I'd appreciate if @JuergenSchreiber can deal with the output of calibrate_kgain.py, and Jason writes the Dataset object for stack_array and stack_array2 for both calibration functions: calibrate_kgain and calibrate_nonlin (bear in mind, as detailed by Max in some of the comments above, that there's a bunch of info, like exposure times that will be retrieved from that Dataset). I think I can deal with the rest of the tasks.

I've also resolved the conversations in the PR that are done

semaphoreP commented 4 months ago

While I wrote the dataset helper function in the most recent PR (#131), I think it's most efficient if Sergi or Juergen implements the dataset stacks here. You can look at the test function I wrote in PR #131 for how to call it, but you guys probably are much more familiar what stacks you need to break it up in this code.

hsergi commented 4 months ago

EDITED: For mock data, I can use parameters freely. I do not need to keep track of those in the Dataset except for those needed in the calibration functions (commanded EM gain and exposure times, for instance).

In any case, I was curious about the version control of Header keywords so that if some new ones are defined, as it will be in the output of the calibration functions, they are tracked in some document or somehow.


Is there a document/file that keeps track of newly defined keywords? (Default ones are in https://collaboration.ipac.caltech.edu/display/romancoronagraph/ExCam+image+header+keywords)

hsergi commented 4 months ago

[Comment got posted. I'm reviewing the 100+ posts for the comments about it. One minute]

@maxwellmb I have to implement the single object return in calibrate_nonlin.py. I asked a couple of questions on slack. There's many comments so far in this PR. I'd like to confirm the steps for the implementation:

calibrate_nonlin.py now returns 4 variables:

 **headers** (np.array): 1-D array of headers used to build csv-lines. The length is equal to
    the number of columns in 'nonlin_arr' and is one greater than the
    length of 'actual_gain_arr'.
  **nonlin_arr** (np.array): 2-D array with nonlinearity values for input signal level (DN) in rows
    and EM gain values in columns. The input signal in DN is the first column.
    Signal values start with min_write and run through max_write in steps
    of 20 DN.
  **csv_lines** (list): List of strings containing the contents of 'headers' and 'nonlin_arr'.
  **means_min_max** (np.array): minima and maxima of mean values (in DN) used the fit each for EM gain. The size of means_min_max is N x 2, where N is the length of actual_gain_arr.

Is the output nonlin_arr, nonlin_raw in the NonLinearity class?

Should I add headers, csv_lines and means_min_max as new data arrays?

I found this from above link "you should put those outputs into the KGain object header, and where appropriate (e.g. for an array of values) add an extension HDU for arrays, like the PTC (including useful headerwords about the content of the extension, and a name, per the developer guidelines)."

hsergi commented 4 months ago

Summary of the PR above (I may have missed something)

To-do

1/ Max’s comment in the PR: I think (but I'm not 100% sure) that nonlin_tableTVAC.txt might be able to replace the existing nonlin_sample.csv and associated fits file. This is a nice-to-have not a full requirement in order to keep the number of test_data files as small as possible: https://github.com/roman-corgi/corgidrp/pull/116#pullrequestreview-2163979641:

2/ Sergi. Return single object for Non-linearity. Confirmation/questions for Max: https://github.com/roman-corgi/corgidrp/pull/116#issuecomment-2243517984

3/ Sergi. Check plots/results

Future work?

1/ actual_gain_array: https://github.com/roman-corgi/corgidrp/pull/116#discussion_r1669132879:

2/ ROI regions: A future PR (or this one possibly) could implement the use of detector_areas and slice_section, tools in detector.py for that instead of relying on the offset_roi parameters. (Kevin): https://github.com/roman-corgi/corgidrp/pull/116#discussion_r1674701793

3/ (Kgain error is now part of Kgain. Not sure this is necessary any longer)An error for the k gain could be obtained and put into an err of the NonLinearityCalibration class instantiation that makes the calibration product. That would require more than a mere port of this code, but I don't think it wouldn't be too hard to implement. I didn't know how precise we wanted to be with that. Perhaps that could be made a GitHub issue as a possible enhancement?: https://github.com/roman-corgi/corgidrp/pull/116#discussion_r1651599405

4/ Add readout noise estimation to KGain: https://github.com/roman-corgi/corgidrp/issues/135

semaphoreP commented 4 months ago

Hi Sergi, as mentioned on slack, the input to NonLinearityCalibration is the np.vstack((header,nonlin_arr))

hsergi commented 4 months ago

New comment from @maxwellmb: "one useful thing to do for both K-gain and NL would be to define in the docstring the detailed info about the expectation on the input data: For Example - What previous steps have been run? Is it L1, L2a, or is it L2a, but with a skipped step. You could take a look at Kevin’s Docstring for the NoiseMaps."

maxwellmb commented 4 months ago

@kjl0025 I think you need to approve here too since at some point you "Requested changes"

kjl0025 commented 3 months ago

And the nonlin code currently does not have an err output yet, but I believe Sergi made an issue for that for a future pull request (and it is also covered in Issue #124).