pysal / mgwr

Multiscale Geographically Weighted Regression (MGWR)
https://mgwr.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
368 stars 126 forks source link

'ValueError: kth out of bounds' in MGWR fitting #88

Closed DKchenliyuan closed 3 years ago

DKchenliyuan commented 3 years ago

Hi! I try to fit a MGWR model based on the data (post-process .shp file in attachment). point_trans.zip

My code is as follow:

soil = gp.read_file('point_trans.shp')
s_y = soil['TNPC'].values.reshape((-1,1))  
s_x = soil[['SOCgkg', 'ClayPC', 'SiltPC', 'NO3Ngkg', 'NH4Ngkg']].values
s_u = soil['X']
s_v = soil['Y']
s_coords = list(zip(s_u, s_v))
s_x = (s_x - s_x.mean(axis = 0)) / s_x.std(axis = 0)
s_y = (s_y - s_y.mean(axis = 0)) / s_y.std(axis = 0)
mgwr_selector_soil = Sel_BW(s_coords, s_y, s_x, multi = True, fixed = True)
mgwr_bw_soil = mgwr_selector_soil.search()
print(mgwr_bw_soil)
mgwr_results_soil = MGWR(s_coords, s_y, s_x, mgwr_selector_soil).fit()

First, I notice that the bandwidths of 'ClayPC' and 'NH4Ngkg' (both 7484.08m) exceed the maximum distance (3742.06m) between observation points. These two bandwidths were both 3741.7m in the existing study (see Analyst A in Comber et al. 2020). I don't know if this bias is caused by my mistake. It seems that it happens when the bandwidth tends to be global.

Besides, I get the following error message when the model fitting:

ValueError                                Traceback (most recent call last)
<ipython-input-44-7199e7274476> in <module>
----> 1 mgwr_results_soil = MGWR(s_coords, s_y, s_x, mgwr_selector_soil).fit()

C:\Anaconda\envs\basemap\lib\site-packages\mgwr\gwr.py in fit(self, n_chunks, pool)
   1593                        tqdm(range(self.n_chunks), desc='Inference'))
   1594 
-> 1595         rslt_list = list(zip(*rslt))
   1596         ENP_j = np.sum(np.array(rslt_list[0]), axis=0)
   1597         CCT = np.sum(np.array(rslt_list[1]), axis=0)

C:\Anaconda\envs\basemap\lib\site-packages\mgwr\gwr.py in _chunk_compute_R(self, chunk_id)
   1520 
   1521         for i in range(n):
-> 1522             wi = self._build_wi(i, self.bw_init).reshape(-1, 1)
   1523             xT = (self.X * wi).T
   1524             P = np.linalg.solve(xT.dot(self.X), xT).dot(init_pR).T

C:\Anaconda\envs\basemap\lib\site-packages\mgwr\gwr.py in _build_wi(self, i, bw)
    236             wi = Kernel(i, self.coords, bw, fixed=self.fixed,
    237                         function=self.kernel, points=self.points,
--> 238                         spherical=self.spherical).kernel
    239         except BaseException:
    240             raise  # TypeError('Unsupported kernel function  ', kernel)

C:\Anaconda\envs\basemap\lib\site-packages\mgwr\kernels.py in __init__(self, i, data, bw, fixed, function, eps, ids, points, spherical)
     56             self.bandwidth = np.partition(
     57                 self.dvec,
---> 58                 int(bw) - 1)[int(bw) - 1] * eps  #partial sort in O(n) Time
     59 
     60         self.kernel = self._kernel_funcs(self.dvec / self.bandwidth)

<__array_function__ internals> in partition(*args, **kwargs)

C:\Anaconda\envs\basemap\lib\site-packages\numpy\core\fromnumeric.py in partition(a, kth, axis, kind, order)
    744     else:
    745         a = asanyarray(a).copy(order="K")
--> 746     a.partition(kth, axis=axis, kind=kind, order=order)
    747     return a
    748 

ValueError: kth(=972) out of bounds (689)

There are 689 observation points in the data, but what does 'kth(=972)' mean? How can I deal with this problem?

Any comments on this greatly appreciated.

Ziqi-Li commented 3 years ago

@DKchenliyuan

Thanks for reporting the issue. What I see here is that you are using a fixed bi-square kernel. We had some internal discussions previously and found that the combinations of fixed-bisquare and adaptive-gaussian sometimes have some fitting issues. So we would suggest to use either adaptive-bisquare or fixed-gaussian kernel. I think the reason is that a fixed-bisquare kernel may contain too few points (plus the weights drop to zero at bandwidth) at some locations when the spatial distribution of the points are very irregular. If you changed to a gaussian kernel, it would run ok. Please see the attached screenshot.

Regarding the bandwidth exceeds the largest pair-wise distance, this is possible and expected when the relationship approaches global. The upper bound of the bandwidth search range in mgwr and GWmodel may be different. Here we use the twice of the maximum pairwise distance as the upper limit. See this line of code here. That may explain why the bandwidth is mgwr is almost twice of the one reported in the Comber paper (I guess Lex was using GWmodel)?

image
DKchenliyuan commented 3 years ago

@Ziqi-Li Thanks for your comment. It helps me a lot. An additional question: I find mgwr seems to have no functions to directly or conveniently fit the Mixed GWR model (such as gwr.mixed or gwr.multiscale(...bws0=c(Inf, 100, 100, Inf)...) in GWmodel). Is that right?

Ziqi-Li commented 3 years ago

@DKchenliyuan Yes, the bandwidth cannot be forced to be inf yet. Please see this issue for some discussions #4. It is not something hard to do, so we could probably add this in somewhere in the summer.

I'm closing the issue, if anything else pops up, feel free to reopen it.