Investigate strange blobby source distribution in some HAP point catalogs

spacetelescope / drizzlepac

AstroDrizzle for HST images.

https://drizzlepac.readthedocs.io

BSD 3-Clause "New" or "Revised" License

50 stars 39 forks source link

Investigate strange blobby source distribution in some HAP point catalogs #1773

Open stscijgbot-hstdp opened 5 months ago

stscijgbot-hstdp commented 5 months ago

Issue HLA-1240 was created on JIRA by Rick White:

There is a fairly rare failure mode where the HAP point catalog source detection goes crazy and generates big blobs with thousands of spurious sources scattered around the image. I don't know what triggers this, but my guess is that something has gone wrong with the background estimation. That could lead to this kind of structure.

Not a high priority to fix since it is rare, but if the problem were identified and fixed, it would probably lead to quality improvements in other catalogs that are not so obviously bad.

A sample image is appended below in the attachments and comments. An additional comment lists 16 datasets that are known to have the problem. The problem has only been found in ACS/WFC catalogs. It is likely that at least 125 ACS/WFC visits have the problem (about 0.8% of the ACS/WFC HAP-SVM visits).

stscijgbot-hstdp commented 5 months ago

Comment by Rick White on JIRA:

An example image hst_13057_12_as_wfc_jc3f12 is shown. The image itself looks fine: it is a crowded field with a lot of saturated stars, which may be a clue, but is not bad. The segment catalog looks good. The point catalog is crazy. The image below shows the HAP-SVM image with the point sources overplotted in red. Sources flagged as bad have been removed. The big patches of sources are clearly wrong.

!hst_13057_12_as_wfc_jc3f12-point.png|width=70%!

stscijgbot-hstdp commented 5 months ago

Comment by Rick White on JIRA:

I have identified multiple additional datasets with this problem. So far I have only seen this issue in ACS/WFC datasets. We do not use the other ACS detectors in the Hubble Source Catalog, but it is likely that if the problem occurred for WFC3/UVIS or WFC3/IR, it would have been discovered. So it is probable that this is an ACS-only problem. (No WFPC2 catalogs have been examined.)

I estimate there are at least 125 such datasets in the sample we are using for the HSC. There are a total of about 16,400 ACS/WFC visits in the sample, so approximately 0.8% of the HAP ACS/WFC catalogs probably have the issue.

The table lists 16 ACS visits that are known to have bad, blobby point catalogs. The table includes the number of good sources (according to the flags) in the HAP source lists and a link to the HLA interactive display that overlays the point catalog.

||datasetname||segmentCatNobj||pointCatNobj||display|| | hst_10775_19_acs_wfc_total_j9l919 | 4164 | 20361 | display | | hst_10775_47_acs_wfc_total_j9l947 | 3453 | 7951 | display | | hst_12586_17_acs_wfc_total_jbts17 | 2017 | 7378 | display | | hst_13057_10_acs_wfc_total_jc3f10 | 2406 | 11789 | display | | hst_13057_12_acs_wfc_total_jc3f12 | 3145 | 15025 | display | | hst_13057_21_acs_wfc_total_jc3f21 | 2399 | 7257 | display | | hst_13057_39_acs_wfc_total_jc3f39 | 2495 | 13203 | display | | hst_13057_a5_acs_wfc_total_jc3fa5 | 2588 | 14094 | display | | hst_13463_09_acs_wfc_total_jc8x09 | 2093 | 12420 | display | | hst_13463_11_acs_wfc_total_jc8x11 | 3791 | 26904 | display | | hst_13463_30_acs_wfc_total_jc8x30 | 1943 | 15342 | display | | hst_13463_40_acs_wfc_total_jc8x40 | 2687 | 21429 | display | | hst_13463_42_acs_wfc_total_jc8x42 | 3333 | 15304 | display | | hst_13463_59_acs_wfc_total_jc8x59 | 2321 | 10742 | display | | hst_9700_01_acs_wfc_total_j8kr01 | 164 | 321 | display | | hst_9750_19_acs_wfc_total_j8q619 | 3688 | 13247 | display |

stscijgbot-hstdp commented 1 month ago

Comment by Michele De La Pena on JIRA:

Confirmed this issue still exists even with all the improvements/changes which have been implemented thus far (3.7.1rc5). As stated in this ticket, this looks to be a Point catalog problem as this is not observed in the Segment catalog. However, both algorithms utilize the same background, though the Point algorithm may alter the backround or threshold above which signal could be considered sources.

stscijgbot-hstdp commented 1 month ago

Comment by Rick White on JIRA:

Michele De La Pena Thanks for the update! I suspect that the segment catalog is not affected by the bad background because it uses the RickerWavelet2D kernel. That does a local subtraction of the background (if there is a non-zero background after the sky subtraction). So the segment catalog is able to recover from whatever has gone wrong with the background.

If I'm right, you will see that the segment catalog for the bad field uses the RW2D kernel instead of the Gaussian kernel.

stscijgbot-hstdp commented 1 month ago

Comment by Michele De La Pena on JIRA:

Rick White This dataset, hst_13057_12, uses a background image as determined from the photutils.background.Background2D algorithm. The median background value is ~1.5 and the median RMS background is ~10.43. This background image is used for both the point and segmentation algorithms. It has been observed that the background image has structure which maps to the blobby regions of source over-detection for the point catalogs. The background image is scaled here to emphasize the different regions.

!background_image.png!

Why is the blobby source detection seen in the point catalog but not in the segmentation catalog as both algorithms use a background-subtracted image in the detection process?

The point catalog used the "psf" starfinder algorithm to detect sources which employs theoretical PSFs to identify point sources. This algorithm is used for all detectors, though there are two other options which can be set in the configuration file: "dao" and "iraf". In the "psf" case (1) drz/drc image is background-subtracted (2) background-subtracted image is scaled by a weight mask (Note: The weight masks, of which there is typically only one are supposed to provide a localized weighting scheme for the data.) (3) the (background_RMS * scale_factors) is subtracted from the result of item (2) At this stage the resultant image from (3) has only the voids in the original background image left for source detection!!! The algorithm sets all negative values to zero of which basically all the black regions are negative so only the "blobby" regions are remaining for source detection. See the attached images for Step (2) and Step (3).

!background-subtractedAndScaled.png!

!ImageUsedForPointSourceDetection.png!

In contrast, the segmentation catalog used the Gaussian filter kernel to detect its segments/sources. In this case: (1) a threshold image is computed based on (nsigma * background_RMS) (Note: In this case the image is a constant f ~16.5 for the illuminated portion of the drc/drz.) (2) drz/drc image is background-subtracted (3) the background-subtracted image is convolved with the Gaussian kernel (4) Setting the scaling of the convolved image to match that of the point images, one can see the imprint of the background image, but to a lesser extent than the corresponding image from the point algorithm (3). Negative values are not clipped to zero. The convolved image and threshold are used for source detection. See the image associated with Step (4).

!SegmentationConvolvedImage.png!

stscijgbot-hstdp commented 1 month ago

Comment by Rick White on JIRA:

Thanks for all the info, Michele De La Pena! I see that I was wrong about the RickerWavelet kernel being used for the segment catalog. I think perhaps the convolution by the Gaussian kernel reduces the effect of the bad background (as you suggest). But it may be that the new segment catalog, which ought to be deeper than the old one, also shows some additional sources in the low-background regions? It would be worth plotting the positions of the segment sources to check that. The old segment catalog had a very high detection threshold compared with the point catalog, and I think that might be why it looked relatively uniform.

It seems like regardless of the difference between the segment and point catalogs, there must be something wrong with the background. The blobs of point sources are found in the background holes, and the background peaks create empty regions. To fix this problem, we will need to fix the background.

I'm looking at the code where Background2D is called. It loops over values for the exclude_percentile. Unless an exception occurs (I think catching any Exception is bad -- probably it intends to catch this ValueError), it uses the first value of percentile = 10%. That exception occurs only when every box has more than 10% rejected pixels.

I see from the Percentile in use log message that that did not happen in the data in the ops cache. It said it always used percentile=10. I guess it is worth looking to see whether the same is true in the current version of the processing (probably it will be the same).

Possibly the box_size and win_size parameters could be to blame? According to the log file box_size=27 and win_size=3. That win_size (filter_size in Background2D) is the default value, but does seem small. Exploring all the values for these parameters (and the other Background2D parameters) looks difficult though.

stscijgbot-hstdp commented 1 month ago

Comment by Michele De La Pena on JIRA:

Rick White I agree the actual problem is in the background determination. We have seen these background rosettes previously and did not care for them very much! In some ways a good background determination seems to be the hardest part of this problem.

stscijgbot-hstdp commented 1 month ago

Comment by Michele De La Pena on JIRA:

For this type of image, the simple statistics background determination works better. The background2D technique is supposed to accommodate images which have varying level of background signal across the image. Forcing the simple statistics background to be used via a configuration variable, the sources are well-distributed. I have put the point and segmentation ecsv files in my home directory for you on the Linux systems: /home/mdelapena/ForRick.

The problem(s) seem to come down to being able to choose the appropriate background determination technique to use, and/or when the background2D is in use, tune the parameters appropriately so we do not get rosettes. This is easily said, harder to do.

stscijgbot-hstdp commented 1 month ago

Comment by Rick White on JIRA:

Michele De La Pena OK, that is interesting! I looked at these catalogs, and don't see any sign of the blobby distribution. Both the point and segment catalogs look excellent. Maybe you were looking at an older version of the catalogs?

Here are the full-frame images. I used the old version of the total_drc.fits file but overlaid the new catalogs. Here is a interactive display link that was used. The 3 plots show the image, the point catalog (red) and the segment catalog (blue). !hst_13057_12_acs_wfc_total-new.jpg|width=32%! !hst_13057_12_acs_wfc_total-new-point.jpg|width=32%! !hst_13057_12_acs_wfc_total-new-seg.jpg|width=32%! Those all look very smooth! And if you zoom in on the middle of the image and blink the catalogs with the image, you can see that the catalogs (despite being very crowded) are of excellent quality, with every object on a visible star: !hst_13057_12_acs_wfc_total-zoom.gif|width=100%!

So at least for this image, the results look very good now.

stscijgbot-hstdp commented 1 month ago

Comment by Michele De La Pena on JIRA:

@⁣rlw To clarify, once I changed the background algorithm used, I got the nice, new, well-distributed source candidate results. In fact, I think we agree the results are very good. This was an experiment. The issue comes down to what I did not say very well in my comment on 02 July 2024 at 4:13 pm.

There are essentially two background determination techniques available: (1) the so-called sigma-clipped statistics where a single value is determined and used to represent the entire 2D background, or (2) the background2D algorithm from Photutils which is an empirically determined 2D surface.

There are two problems which should be addressed: (a) improve the criteria for the code to determine which background technique to use, and (b) better "tune" the Photutils background2D so rosettes are not produced.

stscijgbot-hstdp commented 1 month ago

Comment by Rick White on JIRA:

Michele De La Pena Do you still have the versions of the catalogs that used the 2D backgrounds? I'd like to see those too to see whether the segment catalog also shows blobs.

stscijgbot-hstdp commented 1 month ago

Comment by Michele De La Pena on JIRA:

Rick White I have put the catalogs generated with the latest version of Drizzlepac (v3.7.1rc5) where the blobby nature of the point catalog is apparent, but the segmentation catalog does not show any blobs to my eyes! The catalogs are in /home/mdelapena/ForRick/BLOBBY. Just as a reminder the "statistics" catalogs are still in /home/mdelapena/ForRick.

stscijgbot-hstdp commented 1 month ago

Comment by Rick White on JIRA:

Michele De La Pena Thank you. I confirm that the segment catalog does not look blobby with the Background2D, while the point catalog does.

Can I get the filter ecsv catalogs for both versions too? I'd like to look at the backgrounds and other properties for individual sources, but the total catalogs don't include most of those extra columns.

But here's something notable: the segment catalog avoids being blobby by having many fewer sources everywhere (both inside and outside the blobs) compared with the single-background version. In contrast, the point blobby-vs-single catalogs have very similar sources inside the blobs and have many fewer sources outside the blobs. Here are some blinking images showing the two catalogs: !hst_13057_12_acs_wfc_total-point-blobby.gif|width=49%! !hst_13057_12_acs_wfc_total-seg-blobby.gif|width=49%!

This makes me think about the impact of Background2D differently: (1) For the point catalog, the Background2D version has holes with reduced source counts in regions where the background is high, compared with the single-background version. (2) For the segment catalog, the Background2D version has fewer sources everywhere compared with the single-background version.

I wonder whether the problem might not be the background itself, but could instead be due to the background RMS estimate? Maybe comparing the noise for individual sources will clarify what is going on.

stscijgbot-hstdp commented 1 month ago

Comment by Michele De La Pena on JIRA:

Rick White I have put StatsCat.tar in /home/mdelapena/ForRick. These are the catalogs based upon the sigma-clipped statistics. I have put BlobbyCat.tar in /home/mdelapena/ForRick/BLOBBY . These are the catalogs associated with the Photutils background2D where the sources are spatially bunched up for the point catalog.

stscijgbot-hstdp commented 1 month ago

Comment by Rick White on JIRA:

Michele De La Pena I matched the objects in the two segment catalogs (which I will refer to as the Single-bkg and Background2D catalogs). I did the comparisons for both filter f606w and f814w. The results were similar, so I'll show some plots for f606w.

For the f606w catalog, the Single-bkg segment catalog has 104,915 sources while the Background2D version has only 8,859 sources. When I match the catalogs using the xy positions, almost all the Background2D source match (8,486 of 8,859 = 95.7%). I ignore sources that have Flags>5 in either catalog, which leaves 7,906 matches. That is what I compare.

I looked at all the parameters in the catalogs. It looks like the biggest change is in the Area column, which has the area in square pixels of each source segment. For bright objects, the Background2D catalog sources have a larger area (by about a factor of 2). But as sources get fainter, that difference shrinks. For sources around magnitude MagAp2 = 19.8, the areas are about equal. And for fainter sources, the Background2D area continues to shrink so the areas are smaller than the Single-bkg sources.

This is important because one of the criteria for detecting sources in the segment catalog is that the number of pixels above the detection threshold must exceed a lower limit of 15 connected pixels. For the Background2D version of the catalog, that threshold is reached at a magnitude of about 20.5. Fainter sources have only a few pixels in their segments and so get dropped. That 20.5 magnitude limit is very bright in this field -- the Single-bkg catalog includes sources fainter than magnitude 25, or about 100 times fainter than magnitude 20.5.

Here is a plot showing the areas: !compare1_f606w.png|width=100%! The left panel shows the Background2D Area (y-axis) versus the Single-bkg Area (x-axis). The dashed line shows where there are equal. You can see that the Background2D points are above the line for bigger islands (= brighter sources) and then drops below the line. (Note that I added some random noise to the integer values in these plots so all the dots don't get plotted at a single position.)

The right panel shows the difference in the areas versus the MagAp2 magnitude from the Single-bkg catalog. Positive values have larger areas in the Background2D catalog. You can see that the sources start out with much larger areas for bright sources (left side, mag = 17) and then drop dramatically for fainter sources (right edge, mag > 20).

I don't have an explanation for this behavior! It seems like small changes in the sky estimate can create large differences in the apparent areas. Note that these area changes are found across the entire image, not just in the "blobby" regions or the "empty" regions where the background estimate is lower or higher. If we can understand why so few pixels are found above the threshold, maybe we can make some progress toward understanding the effect of Background2D on both catalogs.

stscijgbot-hstdp commented 1 month ago

Comment by Rick White on JIRA:

Looking at the current ops total_trl.txt file for this visit. In the initial sigma-clipped background computation, I see this (editted for compactness):

Background Computation
File: hst_13057_12_acs_wfc_total_jc3f12_drc.fits
Background discriminant - skew threshold: 0.5
Sigma-clipped computed skewness: 1.11
Sigma-clipped Statistics - Background mean: 0.3672  median: 0.3613 rms: 0.06219

Then it runs the Background2D computation and says:

Computing the background using the Background2D algorithm.
Percentile in use: 10
Percentage of negative values in the background subtracted image 0.60 vs low threshold of 15.00.
*** Use the background image determined from the Background2D. ***
Median background: 1.3704
Median RMS background: 9.4578

That seems like a tremendously large change in both the median background (from 0.36 to 1.37) and in the median RMS background (from 0.06 to 9.5!) The huge rms value is probably the reason for the very high detection threshold that gets used in this image.

I wonder whether we have any images where the Background2D results are actually good? I'm going to rummage through the ops cache to see if I can find any.

stscijgbot-hstdp commented 1 month ago

Comment by Rick White on JIRA:

Michele De La Pena

I have done more analysis on the differences between the Background2D and sigma_clip versions of the backgrounds. I have a suggested approach to work around the problem. I say "work around" because my method will avoid using the Background2D results when they are bad and are causing the blobby catalogs, while leaving the results as they are if the Background2D results appear OK. That's not quite as good as changing the parameters used when the Background2D is created. But still I think it will improve the results for the blobby catalogs and will not make unnecessary changes otherwise.

I created a grit repository with my code. Briefly, the approach is:

Read all the trailer files for the HAP products from the HST public cache. Parse out the information on the sigma_clip and Background2D, and write the results to csv files.
Read the csv files in a Jupyter notebook and analyze them. The repository includes the scripts that read the trailer files, the (compressed) csv files with the results, and the Jupyter notebook.

Here is a plot from the notebook that summarizes the results: !HAP-background2.png|width=100%! The 3 columns show the results for ACS/WFC, WFC3/UVIS, and WFC3/IR. (The notebook also includes some plots for ACS/HRC and ACS/SBC, but those are much less common.) The first row plots the median background for Background2D (y-axis) versus the median from sigma_clip (x-axis). The second row plots the rms value from Background2D versus the rms from sigma_clip. Each point is for a single visit (using the total image). The green dashed line shows where the points would be equal. Note both axes are logarithmic.

Most points have similar median sky values (top row), but there are some visits that are outliers. In the bottom row, the rms values are usually smaller for Background2D (below the green line), which is good. But there are a bunch of points where the Background2D rms is much larger than the sigma_clip rms. The orange points show the ones where the increase in rms is at least a factor of 2.

All the bad blobby catalogs in my sample list are in those bad orange points. I have looked at a bunch of images (all the way down to ratios of 1 between the rms values). The large rms values for Background2D look like they are a good way to find the blobby catalogs.

My proposed change is to reject the Background2D if it has an rms that is larger than the sigma_clip rms. A conservative approach would be to apply that only to ACS/WFC and WFC3/UVIS (WFC3/IR does not appear to have the problem, it does not show the worst case points). A more aggressive approach would be to just apply this to all instruments. I think that would also be safe, and I have not seen any cases where that would clearly degrade the results due to a truly variable background not being present.

The number of visits that will be changed by this is fairly small. Here are the counts for various instruments. The last column total is the total number of visits that used Background2D. The ratio>1 column is the number that would instead use sigma_clip. Between 10% and 25% of the catalogs that currently use Background2D would change. And already only about 20% of the catalogs use Background2D, so overall 2% to 5% of the catalogs will change. I think that is probably pretty reasonable.

|| inst || ratio<=1 || ratio>1 || total || | ACS/HRC | 163 | 16 | 179 | | ACS/WFC | 4236 | 536 | 4772 | | WFC3/IR | 3301 | 931 | 4232 | | WFC3/UVIS | 2639 | 358 | 2997 |

stscijgbot-hstdp commented 1 month ago

Comment by Michele De La Pena on JIRA:

@⁣rlw Thank you for all your hard work. I will check what you have done now that I am back to work on this ticket. I had to step away for a bit to work other issues. Just so you know, I am probably going to close this ticket and open a new one which is to implement a solution to this problem. I had already changed this ticket to be an {}investigation{}. Of course the new ticket will reference this ticket.

stscijgbot-hstdp commented 1 month ago

Comment by Michele De La Pena on JIRA:

Closing this ticket as Done as it was meant to be "discovery/investigation". The suggested solutions HLA-1284, HLA-1286, and HLA-1287 will implement solutions or at least mitigating algorithms regarding this issue.