Easier support for raster data: custom bins, user-defined maxlag, return count, etc...

Hi there!

First, thanks for all the work on scikit-gstat :+1:. It's great and I'm very happy to see geostatistics emerging in Python! I used skgstat quite a bit recently (moving away from R), cited it in our last paper! :wink:

I work a lot with satellite data, and quantifying spatial correlations is crucial there too, especially for uncertainty estimation. However, due to the large number of samples, we can't calculate pairwise differences for ~ten millions points in an image. We have to subsample intelligently, and sometimes combine variograms from different grids.

For this, there is still some shackles in skgstat which make things a bit difficult. I intended to submit PRs with some proposed changes, but I think it's better if I first present those issues and we discuss on if and what changes would be best!

Choice of maxlag

If I want to subsample different grids, or several times a grid that has too many samples, it is currently not convenient to aggregate results over a similar bin distribution because the maxlag (and therefore all the bins) will be defined by the maximum distance forced by skgstat (note: we need to pass a maxlag larger than the grid size to ensure we use all samples). Example below:

import numpy as np
import skgstat

# Big grid of random data
vals = np.random.normal(0,1,size=(1000, 1000))

# Coordinates
x = np.arange(0,1000)
y = np.arange(0,1000)
xx, yy = np.meshgrid(x,y)

# Flatten everything because we don't care about the 2D at this point
coords = np.dstack((xx.flatten(), yy.flatten())).squeeze()
vals = vals.flatten()

# Run a variogram

# With a million samples, we can wait until next year...
# V = skgstat.Variogram(coordinates=(xx,yy), values=vals, normalize=False)

# Let's subsample instead, but we might need to sample from different subgrids, or we might want to run the sampling
# several times to get more robust results
nrun = 10

list_bins = []
for i in range(nrun):
    # Let's keep it to a thousand samples for speed
    subset = np.random.choice(len(vals), 2000, replace=False)
    coords_sub = coords[subset, :]
    vals_sub = vals[subset]

    # Let's use a maxlag of 2000 which is larger than the maximum grid distance
    V = skgstat.Variogram(coordinates=coords_sub, values=vals_sub, normalize=False, maxlag=2000)

    list_bins.append(V.bins)

maxlags = [list_bins[i][-1] for i in range(nrun)]
print(maxlags)

[1386.8327224290606, 1392.293431716174, 1376.7370845589944, 1383.8731878318908, 1388.1156291894417, 1385.964646013743, 1371.1458711603227, 1372.5906163164602, 1393.026202194345, 1385.9552662333658]

This is an easy fix really, changing in binning.py:

    if maxlag is None or maxlag > np.nanmax(distances):
        maxlag = np.nanmax(distances)

into:

    if maxlag is None: #or maxlag > np.nanmax(distances):
        maxlag = np.nanmax(distances)

But there is also other issues which are closely related, so I think it's worth looking at the bigger picture first.

Custom binning

With satellite data, we tend to have instrument noise that plagues the structure of our data. Due to this, we sometimes see variograms that show correlations at different scales, for example for a 5 m gridded satellite image will show: a first (large) sill at a range of 15 meters which will be linked to the native resolution of the data, a second sill at 500 meters will be due to some short-range instrument noise, and a third at 5 km will be due to long-range instrument noise.

To better work with those correlation at different ranges, it would be good to be able to pass user-defined, custom bins. Maybe through the bin_func? Or an additional argument (as bin_func is currently a string that defines the distribution depending on the distances, although it could be made into a Union[str,Iterable] and the np.ndarray could be passed as custom binned edges!

Fitting a sum of models

Based on what was said above, for our applications we often need to fit a sum of variogram model to get a good representation of the spatial correlation at different scales (e.g., https://www.frontiersin.org/files/Articles/566802/feart-08-566802-HTML-r1/image_m/feart-08-566802-g009.jpg, https://www.nature.com/articles/s41586-021-03436-z/figures/9).

This is something we are currently working on implement in our package xdem: https://github.com/GlacioHack/xdem/blob/main/xdem/spstats.py#L531 (early stages)

But, if this kind of feature feels useful for skgstat, I think it would make sense to have it available directly here as well!

Other minors issues (based on experience of the past year, I hope those are still relevant):

Direct access to the sample "count"

If I'm not mistaken, right now it can only be re-derived using:

count = np.fromiter((g.size for g in V.lag_classes()), dtype=int)

It would be practical to have a count attribute storing this directly.

Argument to not run the model fit by default. For example, when one only wants to sample the empirical variogram, or several of them, and optimize the related processing time.

To wrap things up: if those issues are of interest, I could push propositions of changes as PRs for 1/2/4 quite rapidly! I let you tell me what's best. And thanks again for the great work!

Thanks a lot for the feedback!

Yeah the maxlag still needs some refactoring, especially since the distance matrix is managed by another class internally (MetricSpace). There are also still some issues regarding MetricSpace and maxlag (see #101). I can't really spot the difference in your code sample though ... ;) So, whatever you propose, I would chec that issue so that any fix of that issue does not break your contribution (or the other way around).
That's something that is on my todo list literally for years. I think the docu is even claiming it would be possible, but It's not. In fact, bin_func is a property and its setter calls the set_bin_funcmethod: https://github.com/mmaelicke/scikit-gstat/blob/270f4725d76825895593fe054088e725810f1de4/skgstat/Variogram.py#L526 Thus, this function can be easily extented to take iterables. It would be awesome to see this finally implemented. It actually takes callables, so you could maybe pass a lambda, but that will remain a ugly hack and the extension of the method is definitely the better approach. It would need another elif in this if-block: https://github.com/mmaelicke/scikit-gstat/blob/270f4725d76825895593fe054088e725810f1de4/skgstat/Variogram.py#L679
Give me a little bit of time to read the resource and wrap my head around it ;)
Great idea! I would recommend using a property to stay updated whenever the user updates something on the Variogram class, that can change the counts.
Yeah, that's downside of my implementation. Up to now, I never came into a situation, where the fitting was the bottleneck, but it is not hard to image such a situation. I would prefer an argument that actively suppresses the fitting. So default behavior would be automatic fitting

I am looking forward to your PRs, thanks a lot!

BTW.: We also recently added a ProbabilisticMetricPair class, which only samples a subset of large input data in a probabilistic way. Maybe that is of intereset for your usecase as well. And by we, I mean @redhog. He's maybe a little bit more into your usecase.

Looking forward to your contributions! Best

Mirko

Ok, so I might not have understood 100% of what you are trying to achieve, but here are my 5 cents:

You should be able to just use ProbabilisticMetricPair for the random subsampling:

coordinates = np.column_stack([a.flatten() for a in np.meshgrid(np.arange(image.shape[0]), np.arange(image.shape[1]))]) coordinates = ProbabalisticMetricSpace(coordinates, max_dist=500, samples=0.1) variogram = Variogram(coordinates, image.flatten())

However after some more consideration, I think it would be advantageous to write a new class, because your input set of coordinates is dense - generating/storing the meshgrid output is quite wasteful. Such a class would be very similar to ProbabilisticMetricPair, but only store the image width and height internally.

Btw, your use case is a bit similar to ours - our input is a dense triangulated surface of measurement points (e.g. surface elevation), typically over an area orders of magnitude larger than the lag. Don't hesitate to ping me if you have problems building your optimizations or want feedback!

Thanks @redhog for the help!

I went through the ProbabilisticMetricPair class. To check I got everything right: basically, it uses random.choice to subsample completely randomly the coordinates in N-D space?

If this is the case, then it is indeed what we do most of the time previous to running the Variogram class. Good to know this exists to use it in skgstat directly instead of doing it on our side :smile: !

However, recently I realized this was not always optimal because sampling a grid of a certain size completely randomly will always lead to a certain density of point/clustering (like we see in the first Figure here: https://scipython.com/blog/poisson-disc-sampling-in-python/) which might lead to difficulties to constrain the empirical variogram for some categories of lags. Typically for very small lags (only a few grid points completely adjacent one to another will be sampled) and large lags (fewer grid points are sampled with pairwise distances in the order of magnitude close to the size of the grid). Increasing the number of samples is not a solution on large grids (e.g., >1 million points), because above 10,000 samples the pairwise differences of the empirical variogram will already take several minutes to compute.

My idea to remediate this issue was to not subsample completely randomly but rather according to a certain spatial distribution. For instance, in 2D, subsampling iteratively in rings of increasing radius (e.g., start by sampling many small disks to resolve the variance at small lags, then many medium-size rings for that at medium lags and finally big rings for that at large lags). The center of the ring would also be sampled randomly several time across the grid. This way one would only needs a couple thousands points to get enough samples at similar pairwise distances, which would be sufficient to significantly improve the empirical variogram at those problematic spatial lags with completely random sampling (or any spatial lag). I hope this is not too badly explained!

It's a bit of a "hacky" solution, but for big grid data it can lead to better sampling at all lags, and this in a lot shorter times (maybe a hundredth? even a thousandth of the computing time when using random selection; depends a lot on the grid size). Maybe you know of a less hacky, existing approach to tackle this?!

Yes, it subsamples coordinates randomly. This means that it does not subsample coordinate pairs randomly, however. What it does is to sample all coordinates into twice, into the two sets L and R, and then produce all distance pairs between these two sets.

When working with a raster, you could probably do better than this.

The only property that your class has to implement is

@property
def dists(self):

This is supposed to return a scipy.sparse.spmatrix (Optimally in CSR or CSC format), where the column and row indexes both corresponds to indexes into your values array for the variogram.

Since your "sampling" is just generating coordinate pairs where x and y are ints in the range (0, image.shape[0]) and (0, image.shape[1]) respectively, what you describe should be pretty straight forward to implement in numpy & scipy.

You probably have to tell your class what the original image dimensions where, since values is not available in a MetricSpace, and would need to be flattened anyway...

I would suggest making a RasterSamplingMetricSpace class that can be parameterized with different ways to sample the space as you describe.

I did some experimentation, and realized something: You can't both get an even sample of distances and of locations and distances at locations at the same time:

The really long distances can only have their endpoints closer to the sides of the grid.

So you can have an even sample of points, but then short distances are more common. Or you can have an even sample of distances and all distances as well sampled for any one location but your center is undersampled Or you can have an even sample of distances and your center is over-sampled for short distances and under-sampled for long distances.

Example of even sample of distances, under sampled center

mmaelicke / scikit-gstat

Easier support for raster data: custom bins, user-defined maxlag, return count, etc... #111