knaaptime commented 3 years ago

this is a first draft at adding a kriging module based on pykrige. Initial explorations were pretty positive, though the quality of the interpolation obviously depends a great deal on the variogram fit

codecov-commenter commented 3 years ago

Codecov Report

Merging #140 (e3a07b6) into master (32c8525) will decrease coverage by 3.45%. The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master     #140      +/-   ##
==========================================
- Coverage   81.25%   77.79%   -3.46%     
==========================================
  Files          17       19       +2     
  Lines         832      869      +37     
==========================================
  Hits          676      676              
- Misses        156      193      +37

Impacted Files	Coverage Δ
tobler/kriging/__init__.py	`0.00% <0.00%> (ø)`
tobler/kriging/kriging.py	`0.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 32c8525...e3a07b6. Read the comment docs.

knaaptime commented 3 years ago

currently this is just to get started exploring the mechanics of the external libraries. The first fraft takes a really naive approach assigning the predicted value for the target_df centroid to the whole polygon. Instead, we should probably generate a geocube raster of the prediction surface then allow both (a) averaging of pixel values inside the polygon and (b) proper block kriging

knaaptime commented 3 years ago

agreed on both.

i've also played around a bit further and there are a few different ways we could go about this (and maybe provide options for more than one). The question is how we want to shoehorn the very discrete process of human geography into a continuous spatial model (though as you said, it should work reasonably for percentages).

(in the current PR) is the simplest (probably overly so). It estimates the model using polygon centroids from source_df as observations, then uses that model to predict values at the centroids of target_df. The issue here is that, especially for extensive variables like counts, we end up wayy overestimating the volume of the total surface (so that implementation also includes a rescale). We dont have "control" observations in places with 0 population, so the estimated surface doesn't have the variation we need it to have
Estimate the model using source_df centroids, then predict a continuous raster, then take the average of pixel values that fall inside target_df polygons. I think this is closer to the spirit of block kriging, though still looking for the best reference
Rasterize input_df and estimate the model using that raster, then predict a continuous raster and take the average within target_df polys. This might help capture some of the "harder" edges between polygons that get overly smoothed in approach (1), but also kind of inflates the data (estimating raster resolution x polygon area "observations" instead of one per polygon) so might end up with some oddities for places with lots of heterogeneously-sized polygons. This is also really computationally intensive because the training data becomes so large, so a hybrid option of sorts might be to use something like pointpats to drop random points inside each polygon and use those as observations

knaaptime commented 3 years ago

actually, a 4th option riffing on 3, would be to include auxiliary data to mask out uninhabited regions of source_df, then randomly drop points in the inhabited areas and assign them them values from source_df, then in uninhabited areas drop random points and assign them all 0 and estimate on that "surface"

knaaptime commented 3 years ago

https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1538-4632.2004.tb01135.x

pysal / tobler

[WIP] start kriging module #140

Codecov Report