Misleading threshold meaning when filtering is on

kbarbary commented 7 years ago

One would think that given a constant threshold, filtering would increase the number of objects detected. However, this is not the case:

>>> import sep, fitsio
>>> data = fitsio.read("data/image.fits")
>>> bkg = sep.Background(data, bw=64, bh=64, fw=3, fh=3)
>>> bkg.subfrom(data)
>>> objects1 = sep.extract(data, 2.0, err=bkg.globalrms, filter_kernel=None)

>>> objects2 = sep.extract(data, 2.0, err=bkg.globalrms)

>>> len(objects1), len(objects2)
(77, 60)

With no filter: 77 detections. With a filter, only 60.

It turns out that this is because when filtering, the signal-to-noise ratio (SNR) at a given pixel position is implemented as sum(d[i,j] * k[i, j]) / sum(k[i, j]) (where d is the data and k is the kernel and sums are over pixels in the kernel. However, the correct formula is sum(d[i,j] * k[i, j]) / sqrt(sum(k[i, j]^2). Thus, when we ask for the minimum SNR to be "2.0" when filtering, we're actually asking for a minimum SNR of 2.0 * sum(k[i, j]) / sqrt(sum(k[i, j]^2). We can "correct" this to get a real SNR threshold of 2.0 by doing:

>>> k = sep.default_kernel
>>> correction = np.sqrt(np.sum(k**2)) / np.sum(k)
>>> len(sep.extract(data, 2.0 * correction, err=bkg.globalrms))
162

Note that SExtractor seems to make the same mistake.

This is an easy fix, but it would change the meaning of user's code, so I'm very hesitant to make it. The really annoying part is that when filter_type='matched' and the noise is an array, the correct SNR formula is used, but when filter_type='matched' and noise is a scalar, the incorrect formula is used. The result is that one gets very different answers when noise is a scalar and when noise is a constant array of the same value.

cosmosLukas commented 2 years ago

Thanks for pointing this out! To make my question more clear, lets call sum(d[i,j] * k[i, j]) / sum(k[i, j]) expression 1, sum(d[i,j] * k[i, j]) / sqrt(sum(k[i, j]^2) expression 2, and 2.0 * sum(k[i, j]) / sqrt(sum(k[i, j]^2) expression 3. It seems to me that the correction you suggest applies to expression 3, which assumes that the correct formula for SNR is being used (expression 2); it is clear that applying your suggested correction to expression 3 would yield 2.0. However, you first point out that the wrong formula is being used by SEP/Source Extractor (in some cases; expression 1). So how does the correction fix the original issue, that the wrong formula (expression 1) is being used by the codes?

kbarbary commented 2 years ago

Here is perhaps another way to say it:

As implemented (incorrectly), the code checks that sum(d[i,j] * k[i, j]) / sum(k[i, j]) > thresh

If this evaluates true, we count the pixel as a detection. We'd like this to instead be the correct check: sum(d[i,j] * k[i, j]) / sqrt(sum(k[i, j]^2) > thresh

We aren't going to change the implementation in the code, but we can achieve the same thing by modifying what we put in for thresh. By multiplying both sides of the above inequality by sqrt(sum(k[i, j]**2)) / sum(k[i, j]), the "correct" inequality can be rewritten as sum(d[i,j] * k[i, j]) / sum(k[i, j]) > thresh * sqrt(sum(k[i, j]**2)) / sum(k[i, j])

Note that the left side here is the same as the implementation (first equation). So if we put thresh * sqrt(sum(k[i, j]**2)) / sum(k[i, j]) into the code where thresh would normally go, we'll get the "correct" inequality.

Hope that helps.

kbarbary / sep

Misleading threshold meaning when filtering is on #53