Flagging HAP sources in CR-contaminated images apparently is not working

stscijgbot-hstdp commented 7 months ago

Issue HLA-1213 was created on JIRA by Rick White:

Dithering in observations with the detectors affected by cosmic rays (e.g., ACS/WFC and WFC3/UVIS) often leaves strips around the edge of the image where only a single exposure exists. The HAP catalogs contain CRs in those regions.

The hla_nexp_flags function does some image processing using an image that contains the count of images contributing to each pixel. For sources in regions with a small number of contributing images (typically 0 or 1), a Flag bit of 64 is set to indicate that the sources are not reliable.

Apparently that flagging is not working correctly. There are many cases of HAP images with catalog sources that are concentrated around the edges of the images (or sometimes in the chip-gap region). Those sources ought to be flagged but are not. An example is hst_11359_20_acs_wfc_jb6o20. It has catalogs that look good in the middle of the image, but a large number of spurious CR sources are found on the edges. More details about this image are in the comment below.

-It is not obvious what the problem is.- But for some reason this visit (and all similar visits) are not having CRs flagged correctly.

See comment below for a solution that I think is very likely to fix the problem!

stscijgbot-hstdp commented 7 months ago

Comment by Rick White on JIRA:

The hla_nexp_flags function prints a message to the trailer that, indicating that it did in fact run. However, no sources are found to flag. For visit hst_11359_20_acs_wfc_jb6o20 it finds zero sources to flag in the two filters:

Flagging sources from regions observed with only a small number of exposures.
FLAGGING 0 OF 3696 SOURCES
FLAGGING 0 OF 1218 SOURCES

It is very obvious from looking at the catalogs and the exposures that there are cosmic rays on the edges. The attached figure shows shows the positions of the 3 exposures in each of the two filters (f658n and f814w). !hst_11359_20_acs_wfc_total_jb6o20-src.png|width=100%! Most of the segment catalog sources (red dots) are located on the image edges where there is a single exposure. An examination of the image shows that almost all of those sources are CRs.

stscijgbot-hstdp commented 7 months ago

Comment by Rick White on JIRA:

Here is a random idea for what could be causing the problem. The nexp image is created by counting the non-zero pixels in the various drizzled images. Here is the code line in hap_flag_filter.py.

Possibly those images have NaN values rather than zeros in the pixels that are off the edge of the image? Then this would also just have the total sum of the number of images rather than the number of images actually used.

I don't know enough about the state of the drizzle products at that point of the code, so apologies if this is completely off the track.

stscijgbot-hstdp commented 7 months ago

Comment by Rick White on JIRA:

Aha! I'm almost sure that is the problem! The drizzled images in the product directory have nan values in blank regions. And this line:

comp_drz_data = (fits.getdata(comp_drz_img) != 0).astype(numpy.int32)

... returns True for all nan values!

Changing that line to something like this will fix the problem:

imdata = fits.getdata(comp_drz_img)
comp_drz_data = (np.isfinite(imdata) & (imdata != 0)).astype(numpy.int32)

Possibly you will want to delete imdata after you use it to ensure the FITS file is closed.

stscijgbot-hstdp commented 7 months ago

Comment by Rick White on JIRA:

I found one more place in hla_flag_filters.py that makes this mistake of using drz_image != 0 to find blank pixels.

In make_mask_array() line 1635 it says:

mask = fits.open(drz_image)[1].data != 0

That is incorrect and should check for NaN values as well. Here's a suggested bit of code:

with fits.open(drz_image) as fh:
     mask = np.isfinite(fh[1].data) & (fh[1].data != 0)

There are other ways to code this -- the important thing is to add np.isfinite to the test.

This function gets called from hap_sequencer.py:run_source_list_flagging() on line 857. It gets passed to the hla_flag_filter.py:run_source_list_flagging() function as the hla_flag_msk parameter, and from there to hla_nexp_flags as the mask_data parameter. I don't think it actually changes the calculation but it is probably a good idea to fix the code anyway.

stscijgbot-hstdp commented 4 months ago

Comment by Rick White on JIRA:

Michele De La Pena

Seems like this particular ticket has been lost in the shuffle. This is pretty important and I think will be easy to fix. I could generate a pull request for the changes if you thought that would be useful.

stscijgbot-hstdp commented 3 months ago

Comment by Michele De La Pena on JIRA:

Rick White I implemented the suggested changes provided by Rick, and I thought they were not resolving the issue. However, this was incorrect. Rick's suggested changes did resolve the problem in that the final TOTAL detection catalogs have rows removed which corresponded to sources flagged with the value of 64 (False Detections Near Image Edge).

Here is what is happening:

No filtering is done for detecting sources on the total or detection image, so many cosmic rays can be falsely identified as sources, particularly when there are multiple constituent images which do not entirely overlap (i.e., there are regions where only one image is contributing to that portion of the footprint). It is only once the “measurements” have been made which are performed on the filtered drizzled images that quality checking is done and "sources" may be flagged with the value of 64 (False Detections Near Image Edge) in addition to other bit-encoded values. However, these sources are currently not removed from the filter catalogs, as one can see there are sources in the filter catalogs with flag values of 80 or 88.

However, these sources {}do get filtered from the “final” total catalogs{}. The rows where the Flags_ column(s) contain a value where the bit representing 64 is set, or where the value has been filled with -9999.9 are removed. This means the total catalogs can have fewer rows than the filter catalogs.

I am troubled that the identification stage can find so many faux sources particularly with the segmentation algorithm. For the dataset in this ticket, the segmentation catalog is using the RickerWavelet kernel in the source identification process. Ideally, not so many faux sources should be found in the first place, but this is the subject for a separate ticket.

If @⁣rlw is happy with the current result, we can push the software and proceed to close this ticket. However, I wonder if the output filter catalogs should be cleaned up, similarly to the total catalog. If there are multiple filter catalogs, should the source with the 64 flag be removed from ALL filter catalogs or just the filter catalog where it is flagged with 64 (if somehow a different filter catalog actually obtains a measurement)? I just do not want users to be confused.

stscijgbot-hstdp commented 3 months ago

Comment by Rick White on JIRA:

Michele De La Pena Thanks for the thorough summary. I agree that it is somewhat strange to have more sources in the filter catalogs than in the total catalog. But I'm not inclined to change the current approach for 2 reasons:

There could be scientific value in some of these measurements in the filter catalogs. In the regions where there is CR contamination, most of the sources are bad (= CRs) but some of the sources are actually OK. If users have some way to pick out the good sources (e.g., by matching to some other observation) then the measurements can still be useful. For most purposes we will simply discard them, but occasionally it will be good to have them available.
It will be more complicated to change it!

So I am happy with your current approach and think you should proceed to push the software and close the ticket.

stscijgbot-hstdp commented 3 months ago

Comment by Michele De La Pena on JIRA:

Closing this ticket as the code has been merged (PR#⁠1820). A new ticket has been opened as there were very many CRs found in the single image region. (HLA-1272).

spacetelescope / drizzlepac

Flagging HAP sources in CR-contaminated images apparently is not working #1744