pysal / mgwr

Multiscale Geographically Weighted Regression (MGWR)
https://mgwr.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
367 stars 126 forks source link

Thoughts about adaptive gaussian #3

Open TaylorOshan opened 6 years ago

TaylorOshan commented 6 years ago

In association with Ziqi's original issue and moved here to protect any potential intellectual merit.

TaylorOshan commented 6 years ago

Original issue text:

The adaptive gaussian and adaptive bisquare show totally different curves for the model PctBach~FB+Black+Pov. The optimal bandwidth found for adaptive gaussian would be local and for adaptive bisquare would be global. Test against GWModel is showing exactly same curves. Test against ArcGIS adaptive gaussian is showing similar curve as PySAL's adaptive bisqaure. BTW, ArcGIS claims their gaussian kernel is w = exp(-(d/b)**2), while we are using exp(-0.5*(d/b)**2) which is from the book. Will look into to see what it implies.

#Imports
import numpy as np
import matplotlib.pyplot as plt
import pysal
from pysal.contrib.gwr.gwr import GWR
from pysal.contrib.gwr.sel_bw import Sel_BW
from pysal.contrib.glm.family import Gaussian, Poisson, Binomial

#Load the GA data
data = pysal.open(pysal.examples.get_path('GData_utm.csv'))
y = np.array(data.by_col('PctBach')).reshape((-1,1))
fb  = np.array(data.by_col('PctFB')).reshape((-1,1))
pov = np.array(data.by_col('PctPov')).reshape((-1,1)) 
blk = np.array(data.by_col('PctBlack')).reshape((-1,1))
X = np.hstack([fb,pov,blk])
coords = list(zip(data.by_col('X'), data.by_col('Y')))

#Fit using each bandwidth
pysal_aicc_gau = []
pysal_aicc_bisq = []
for bw in range(20,160):
    results = GWR(coords, y, X, bw,fixed=False,kernel='gaussian',family=Gaussian()).fit()
    pysal_aicc_gau += [results.aicc]
    results = GWR(coords, y, X, bw,fixed=False,kernel='bisquare',family=Gaussian()).fit()
    pysal_aicc_bisq += [results.aicc]

#Plot
bw = range(20,160)
plt.figure(figsize=(12,8))
#plt.plot(bw,arc_aiccs,'r-',label="ArcGIS Adaptive 'Gaussian'") #Calculated from ArcGIS
plt.plot(bw,pysal_aicc_bisq,'b-',label='PySAL Adaptive Bisquare')
plt.plot(bw,pysal_aicc_gau,'y-',label='PySAL Adaptive Gaussian')
plt.xlabel('Bandwidth',fontsize=14)
plt.ylabel('AICc',fontsize=14)
plt.title('Georgia: Bach~ForeignBorn+Poverty+Black',fontsize=20)
plt.legend(fontsize=14)
plt.show()

ga

TaylorOshan commented 6 years ago

@Ziqi-Li comment:

By default, the truncate parameter in adaptive_gauss from kernels.py is set to be False, while change that to True gives the following AICc curve. Also attached the kernel function lines. It is like after truncate point (beyond local neighbors), the gaussian still has pretty large weights.

def adapt_gauss(coords, nn, points=None, dmat=None,sorted_dmat=None):
    w = _Kernel(coords, fixed=False, k=nn-1, function='gwr_gaussian',
            truncate=**True**, points=points, dmat=dmat,sorted_dmat=sorted_dmat)
    return w.kernel

ga-after truncate weight curve

TaylorOshan commented 6 years ago

@ljwolf comment:

might the PySAL gaussian and arcgis gaussian differ precisely by (2*pi)**-2? That's the only thing I can think of immediately, since what you've posted will default to the gwr_gaussian implementation in gwr.kernels, which is not what pysal.weights.Distance.Kernel uses upstream but is the sklearn gaussian.

TaylorOshan commented 6 years ago

@ljwolf: No, that's not it as far as I can tell.

@Ziqi-Li: Sorry, the name 'PySAL's gaussian' I used is ambiguous, but I mean gwr_gaussian.

@ljwolf: OK. I found something I do not expect and am confused.

TaylorOshan commented 6 years ago

@ljwolf:

@TaylorOshan

We compute the distance matrix divided by the bandwidth on line 82 of kernels.py. This gets piped down into the kernel functions, and this is all fine and dandy.

However, the D-matrix is symmetric pairwise distances, and self.bandwidth is a column vector for this data.

Taking @Ziqi-Li's code:

from pysal.contrib.gwr.kernels import _Kernel
test_kernel = _Kernel(coords, bandwidth=None, fixed=False, k=10) #like the adaptive call happening in the GWR class. 

test_kernel.bandwidth.shape
(159,1)
test_kernel.dmat.shape
(159,159)

immediately, I think something's wrong with then using

self._kernel_funcs(self.dmat/self.bandwidth)

because self.bandwidth, as a column vector, will get broadcast columnwise over dmat. For example:

test_square = np.arange(25).reshape(5,5)
test_divisor = np.ones((5,1))*5
print(test_square)
# [[ 0  1  2  3  4]
# [ 5  6  7  8  9]
# [10 11 12 13 14]
# [15 16 17 18 19]
# [20 21 22 23 24]]
print(test_square/test_divisor)
# [[0.  0.2 0.4 0.6 0.8]
# [1.  1.2 1.4 1.6 1.8]
# [2.  2.2 2.4 2.6 2.8]
# [3.  3.2 3.4 3.6 3.8]
# [4.  4.2 4.4 4.6 4.8]]

For our purposes, this makes the "kernel" matrix no longer symmetric, since it's dividing each column by each element of the bandwidth vector. So each column of the kernel is only matched to the ith bandwidth, and each row is standardized by all bandwidths.

np.testing.assert_allclose(test_kernel.dmat, test_kernel.dmat.T) #passes
np.testing.assert_allclose(test_kernel.kernel, test_kernel.kernel.T) #fails!

Now, to be fair, this also fails in PySAL, which is also not something I would expect:

psW = pysal.weights.Distance.Kernel(coords, k=10, fixed=False, function='gaussian').full()[0]
np.testing.assert_allclose(psW, psW.T) #fails!

Where I think this makes us have a little problem is where the actual GWR estimating routine uses the row of the W, not the column.

If it used the column, it would be using the weights corresponding to observation i that have been standardized by that bandwidth. If it uses the row, each element is actually d/bw[j] for j = 1, 2, ..., N. Since bw[j] is different for each j, this means all N bandwidths are involved in the computation for the ith row of the kernel matrix.

Succinctly, the way we have it, we have N bandwidths at each point, not 1 bandwidth at N points. Since the adaptive kernel isn't symmetric, these are not equivalent.

Not sure what this means, but I am confused.

TaylorOshan commented 6 years ago

@ljwolf:

I feel it should use the column because, as I understand the adaptive bandwidth, it should be the vector containing the distances for observation i divided by observation i's unique bandwidth. That would be the column, not the row.

TaylorOshan commented 6 years ago

@ljwolf:

No, sorry, wrong again. The broadcasting is right:

np.ones((5,5)) / (np.arange(5,10).reshape(5,1))
array([[0.2       , 0.2       , 0.2       , 0.2       , 0.2       ],
       [0.16666667, 0.16666667, 0.16666667, 0.16666667, 0.16666667],
       [0.14285714, 0.14285714, 0.14285714, 0.14285714, 0.14285714],
       [0.125     , 0.125     , 0.125     , 0.125     , 0.125     ],
       [0.11111111, 0.11111111, 0.11111111, 0.11111111, 0.11111111]])

but the fact that the kernel is asymmetric is throwing me off.

TaylorOshan commented 6 years ago

@Ziqi-Li:

I think it is the line makes it asymmetric. Sorting each row in ascending distances. Only for adaptive though.

TaylorOshan commented 6 years ago

So I was taking a look at this, and I see how this line could make the D-matrix asymmetric. That line of code first sorts the values and then selects only "k" nearest-neighbors, obviously only when k is not None, which indicates it should be an adaptive kernel. But, does it matter that it is not asymmetric? At that point, it's not a comprehensive D-matrix, but just encapsulated the NN distances and wouldn't be expected to be symmetric. Does that make sense? I imagine if it was a big issue it would cause lots of trouble downstream.

TaylorOshan commented 6 years ago

@ljwolf:

Yeah, it makes sense. I got ahead of myself, since I'm using them in another context where I'd expected them to be symmetric (or assumed them to be). Really shot myself in the foot doing so, now that I realized they're not.

In general, if each observation has its own bandwidth, then the "kernel" becomes just a list of vectors of the weights for each observation given those bandwidths. It's my fault for assuming their concatenation is a "kernel" in the basis function sense, which is symmetric.

TaylorOshan commented 6 years ago

So, I don't know what is going on with the ArcGIS kernels, but from Ziqi's plots, it seems either they are indeed using bisquare and not Gaussian or they have some very sophisticated mechanism to truncate a Gaussian into something very similar to a bisquare. But I don't think its productive to focus on that any further because we may never know what they actually have going on under the hood. Plus, I'm not really getting the sense that what we currently have is wrong for any particular reason. Rather, by design, the Gaussian kernel gives a much different curve than the bisquare. This is also the case with the exponential kernel (shown below), which, for me, is reassuring as I would expect very similar results between Gaussian and exponential and also heightens my suspicion about that the arcGIS results are not simply a Gaussian kernel.

I did some experimenting by comparing different kernel functions across each type of bandwidth (i.e., fixed vs. adaptive). First for the same Georgia data example...

image

image

Top is adaptive kernels and bottom is fixed kernels. The solid vertical lines represent the optimal bandwidth for each kernel function while the dashed vertical lines (only red and yellow for gaussian and exponential) represent the "truncated" bandwidth value. However, its not a real truncation, but is the distance/NN at which the data is essentially weighted at zero (approx. < 0.05) for the given optimal bandwidth and was obtained by essentially plugging the optimal bandwidth and the desired zero-weight approximation value and then solving for the associated distance/NN. I reality, I just did some plugging and chugging to approximate the value, but I assume it could be solved for analytically.

Anyway, I don't think there is anything weird going on with fixed vs. adaptive or bisqaure vs gaussian. The behavior of each kernel function is consistent for either fixed or adaptive kernels. I just think the nature of the Gaussian/exponential kernels is to weight the data much less intensely and without a real truncation and so produces a much smaller bandwidth value compared to a bisquare function. Solving for an approximate zero-weight truncation point yields a distance/NN that is closer that from the bisquare function but ultimately is not equivalent.

I also did the same experiment for data simulated with a single covariate/simulated coefficient surface using some of the data from the comparison paper. I thought it might be useful to see if all of the kernels produce the expected order of bandwidths for B1 (regional) and B2 (local) where the expectation is Bandwidth_B1 > Bandwidth_B2. Its hard to tell in the results below, but for all of the kernels it always holds that B1 > B2 so long as there aren't rules imposed on the minimum search value.

B1 Surface image image image

B2 Surface image

image image

These curve shapes are consistent with the Georgia dataset and so overall I don't think there is anything inherently wrong with any of the kernels and I imagine all of the kernels here produce similar coefficients and could potentially useful for producing "relative" scales in that the bandwidths from MGWR could be compared to each other. However, interpretatively, it seems like the bisquare is at an advantage since the bandwidth quantifies the intensity of decay and exactly the extent of the non-zero data included in the model, which is more intuitive and may be useful for comparing across study areas or different models. Perhaps we could still achieve something similar for the fixed Gaussian/exponential kernels by standardizing the distance matrix. Not sure how one would standardize a nearest neighbor metric so might possibly be useful to exclude an adaptive Gaussian/exponential specification. I can confirm, that as the code stands, the fixed bisquare (and only the fixed bisquare) will truncate the data like the adaptive bisquare. At first glance, it seems like the derived fixed distance is indicative of the distance associated with the Nth nearest neighbor...well the average across of the distances for the Nth nearest neighbor for each row of the weights... so I think another advantage for the bisquare is that the fixed and adaptive bandwidths may be more closely related. Perhaps the same holds for Gaussian/exponential if the weights are manually truncated. Gotta think about it a bit more...

TaylorOshan commented 6 years ago

@ASFotheringham What are the units in the above plots for the fixed bandwidths - are they pixel lengths or is it some measure of distance that relates to the study region?
I don't think we can compare the fixed bandwidths from the gaussian and bisquare can we? isn't the gaussian kernel the distance at which the weights are exp(-0.5). Not sure how the fixed bisquare is computed - is that the distance where the weights fall to zero? Given that the opt bandwidth is supposed to be a trade-off between bias and variance, is there some measure other than AICc which could be used to determine an optimal bandwidth? might be interesting as a first step to plot the accuracy of the local estimates (bias) against bandwidth and also some average measure of the standard errors of the local estimates (variance) at each BW. Can only do this for GWr at the moment because we don't have the SEs of the parameter estimates for MGWR. Intuitively I like the adaptive bisquare - it has a natural interpretation and for local modeling it makes more sense than a gaussian. I've never been convinced either of the need for a fixed kernel - if the optimal weighting function is the same everywhere and the presumably the points are distributed uniformly, wouldn't the adaptive bisquare give you the same results as a fixed? One possible simplifying way forward is therefore to examine the usefulness of kernels other than an adaptive bisquare. If there is no demonstrable need for anything other than an adaptive bisquare, why bother with these other kernels?

TaylorOshan commented 6 years ago

@ljwolf

In general, shouldn't these work for any kernel function? Like... Cleveland's original exploration in the univariate space used tricube functions, something we don't even consider. I think any arbitrary kernel should be (in theory) useful, even though we have our preferences.

I do bet @ASFotheringham is right on the scoring, and we could get different (maybe better) results using a more explicit tradeoff than the AICc, & looking into the bias & local variance would be a good way to get into this. We still get a site-specific MSE in both GWR and MGWR, so I'd use that as a starting point for that variance estimate.

ljwolf commented 6 years ago

@Ziqi-Li more relevant results, I think: This private notebook shows the optimization surfaces for bandwidths in the Georgia example.

I find that the gaussian adaptive stuff sometimes is hyperlocal, when the fixed version is definitely not.

I also find that, sometimes, the kernel and the score can be really influential, but the fixed gaussians and fixed bisquare behave really well, and the adaptive bisquare, too.

Also, BIC is local for adaptive bisquare and global for fixed bisquare. So, not sure what to make of that.

ljwolf commented 6 years ago

Looking at @TaylorOshan results, I think there's a substantive difference in how we're storing the function. I'm inclined to prefer how I do it, but I don't understand why they'd be different. I prefer how I'm doing it in that notebook because I'm storing the actual callable function gwr_func that is optimized (either by scipy or by our search.golden_section). It is this function that will directly govern the bandwidth, not the diagnostic model scraped in the way you're measuring the bw on the GWRResults object. But, I really don't understand why they're different.... 😦

Then, I'm plotting that function over a range of values. You can see that, in the notebook, identified solutions either clearly mark the minima of those functions, or reflect the "maximum" allowable bandwidth, bw_max.

I don't see how this is similarly shown in the other method. So, i'm not sure how the objective function that we're actually optimizing and the final model scores could be so different? I'm fitting exactly the same model as Taylor, so far as I can tell.

TaylorOshan commented 6 years ago

I need to take a closer look at your notebook @ljwolf, but I noticed we do have different models fit. Mine was following Ziqi's rather than what is used in the tests. So its pct_Bach ~ pctFB + pctPov + pctBlack rather than pct_Bach ~ pctRural + pctPov + pctBlack. I'd be surprised if that was causing such a large different though...need to take a closer too to see if I can spot why caching the opt function should be any different than just running different GWR's.

ljwolf commented 6 years ago

OK, I can update that and rerun the notebook in like 5 minutes.

I'm less concerned about the model and more concerned that the AIC you're pulling off using:

results = []
for bandwidth_i in range(min_bw, max_bw):
    results.append(GWR(coords, X, Y,  bandwidth_i).fit().aicc)

is not the same surface as what happens if you just evaluate the objective function actually being optimized in our numerical routines, the lambda bw: getDiag[criterion](GWR(coords, X, y,bw)).fit() call

ljwolf commented 6 years ago

Changing the model does change a few of the score functions being optimized, but they still don't really look like what's posted above.

Attached is a notebook with the updated model. A few change, but.... still looks different from peeling the AIC off after manually setting the bw.

notebook

TaylorOshan commented 6 years ago

Ok, did a quick test because it looked like we were using different supports. I originally had evaluated the adaptive bisquare/Gaussian on the interval [10, 160] which produced something like

image

and if I instead evaluate them on the interval you used of [20, 180], then I get this

image

which looks to be right on track with yours. I think more than anything its just that I plotted the two function on the same figure and the AICc values for the adaptive bisquare are relatively large from the range [10, 20] than I include so much of the fine-grain trend is obfuscated. It could also be in some cases where different bandwidths were found allowing golden section to search in the area < 20 NN results in a different answer than starting at 20.

ljwolf commented 6 years ago

Ahh.... ok, just a scale issue. Then, good. they're the same just appearing different. Great.

Then, yes, I think we should change the maximum value over in the other PR, since half the maximum interpoint distance seems to me to be too small of a search zone. I think it should be plausible that, even if a bandwidth hits that "full map" value, it's still distance-weighting all observations, so it'll be a "local" model around the point.

TaylorOshan commented 6 years ago

Was just thinking about this...Stewart has asked me a few times if we should totally remove the option to do an adaptive Gaussian/exponential since they do not truncate to 0 at the Nth nearest neighbor like bisqaure, but I think these might be useful for Poisson and binomial regression where it can sometimes be an issue to make sure every sub-sample is not saturated with zeros and then becomes ill-conditioned. Just a thought I wanted to get down before it escapes me!

ljwolf commented 6 years ago

We could always adopt a pysal-style truncation and do gaussian at the k-nearest neighbors and do a cliff beyond that?