mmaelicke / scikit-gstat

Geostatistical variogram estimation expansion in the scipy style
https://mmaelicke.github.io/scikit-gstat/
MIT License
223 stars 53 forks source link

Haversine Distance for Geographic Coordinates #108

Open soupoder opened 3 years ago

soupoder commented 3 years ago

I am wondering if its possible to generate a variogram using haversine/great-circle as the distance metric? This doesn't seem to be supported by scipy.spatial.distance.pdist but would be helpful for working directly with geographic coordinates. Is it possible to supply a user defined function to calculate haversine distance?

mmaelicke commented 3 years ago

Hey there, Thanks for your question. Well, while haversine is not implemented into scikit-gstat, The Variogram.dist_func argument does take a callable. The function signature has to be like:

def haversine(p1, p2) -> float:
  pass

Where p1, p2 are iterated along the Variogram.coordintes numpy array. Thus, if the points are 2D in this case, p1.shape == (1,2)

or with Rosetta code's solution:

from math import radians, sin, cos, sqrt, asin

def haversine(p1, p2):
    # my adaption to fit the function signature  
    lat1, lon1 = p1
    lat2, lon2 = p2   

    # original
    R = 6372.8  # Earth radius in kilometers

    dLat = radians(lat2 - lat1)
    dLon = radians(lon2 - lon1)
    lat1 = radians(lat1)
    lat2 = radians(lat2)

    a = sin(dLat / 2)**2 + cos(lat1) * cos(lat2) * sin(dLon / 2)**2
    c = 2 * asin(sqrt(a))

    return R * c

For performance reasons, you might want to implement a numpy solution.

Maybe as a bit of background and for others that might come across this issue: The haversine formula is pretty straightforward, but tends to become quite imprecise, especially on small scales, which I am personally usually working on. An alternative to haversine is Vincenty's solution, which is more accurate, as far as I know. Finally, with great-circle distance, one has to be aware that you should not rely on the distances calculated, what the variogram actually does. Thus, if used, one should note take the computed effective range as a precise estimation of correlation lengths in a strict sense and use the variogram more as a rough estimation to see patterns.

Finally, I am happy to take a pull request if you want to contribute your solution as a default option to scikit-gstat. Best,

Mirko

soupoder commented 3 years ago

Much thanks for your reply Mirko,

I scrapped the following function to use numpy to run the great circle distance calculation to pass as a callable to dist_func:

`def haversine(p1, p2):

lat1, lon1 = p1
lat2, lon2 = p2    

lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

dlon = lon2 - lon1
dlat = lat2 - lat1

a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

c = 2 * np.arcsin(np.sqrt(a))
km = 6367 * c
return km`

However it appears to be running quite slowly compared to the standard Euclidean distance (running on ~30k coords). Looking into it, I am wondering if this is because euclidean distance is able to take advantage of a cKDTree, whereas my custom function is doing a brute search? If so I am wondering if there would be any way to use something like sklearn.neighbors.BallTree (which supports haversine distance) to speed up the process?

mmaelicke commented 3 years ago

Hey there,

Yes, the Variogram class is using an instance of skgstat.MetricSpace to handle coordinates. In case you need to calculate the same set of coordinates for different Variogram over and over again, it is also possible to first create the MetricSpace and then pass it as the coordinates attribute. This will prevent re-calculations. This first creation will, however, still calculate the full distance matrix and take its time for 30k coords.

I think, generally, it should be possible to use a BallTree, however, just passing it is not possible. We would need to tweak the MetricSpace. The cDKTree is created here:

https://github.com/mmaelicke/scikit-gstat/blob/2819ab4a2d0ee6def3665a08e00f7d805ad3f6f8/skgstat/MetricSpace.py#L124

while the class is explicitly checking for euclidean distance here:

https://github.com/mmaelicke/scikit-gstat/blob/2819ab4a2d0ee6def3665a08e00f7d805ad3f6f8/skgstat/MetricSpace.py#L116

to raise an Error if a MetricSpace.tree is accessed and here:

https://github.com/mmaelicke/scikit-gstat/blob/2819ab4a2d0ee6def3665a08e00f7d805ad3f6f8/skgstat/MetricSpace.py#L138

to check whether a maximum distance is set. Otherwise, pdist would be used.

My first thoughts here:

The last check (line 138) however, would need some adaptions to the logic. Here, the existing check could be kept and a new elif is needed to check for haversine or anything else BallTree supports. The else could then be kept as is. This would imply that haversine is always calculated with BallTree, thus a quick benchmark of the numpy solution vs BallTree solution would be needed, otherwise, the elif has to include the check for self.max_distas well (line 138)

If you would like to contribute this, I am happy to take any pull request, otherwise, I can try to implement this soon and then a review and test by you would be highly appreciated. @redhog do you have any thoughts, ideas or opinions on this?

Best Mirko

MuellerSeb commented 3 years ago

My two cents on this topic: Be careful with directly using the haversine distance in the variogram model. The surface of the sphere is a compact manifold (mathematically speaking) and the resulting model when using the haversine distance is not valid anymore (conditional negative semidefinite - CNSD as phrased by Webster 2007).

To overcome this problem, one could use the associated Yadrenko-model for a given spatial 3D variogram model. This uses the chordal distance derived from the geographical coordinates.

We had several discussions about that in PyKrige and GSTools:

When using haversine in standard models and using them later on for building up a kriging matrix, this could result in undefined behavior since there the "conditional negative semidefinite" part is crucial.

Cheers, Sebastian

MuellerSeb commented 3 years ago

@mmaelicke you could think about implementing a switch to use the assoziated yadrenko model, that is able to use the haversine distance:

Starting with a valid isotropic model in 3D, where the covariance between two points is given as a function of their distance:

cov(x1, x2) = func(dist(x1, x2))

We can construct the related yadrenko model on the sphere, that takes the great-circle distance zeta(x1, x2) with the following modification:

cov_sph(x1, x2) = func(2 * sin(zeta(x1, x2) / 2))

So instead of passing the distance d to the variogram, you should pass 2 * np.sin(d/2) when working with the haversine distance. This could also increase the interoperability between GSTools and scikit-gstat :wink:

PS: See the GSTools tutorials for more informations: https://geostat-framework.readthedocs.io/projects/gstools/en/stable/examples/08_geo_coordinates/index.html

mmaelicke commented 3 years ago

Hey, @MuellerSeb thanks for all the insights! I wasn't aware of these implications, that is very good to know. I think I need to look into the paper in some more detail. Just for clarification: When using haversine, do I need to replace the theoretical variogram model with the Yadrenko model, or does it adapt modeled covariances (so kind of wrapping the used model)? In any case, I would have to work with covariances, right? Or can I apply it to semi-variances, as well? Cheers

MuellerSeb commented 3 years ago

It applies to semi-variances as well. Yadrenko derived a way to generate a huge family of valid models on the sphere by making use of valid models in 3D. Since all models in scikit-gstat seem to be valid in 3D, you can derive the associated yadrenko models for all of them by adopting the passed distance in the variogram function.

You can also think this way: When you have lat-lon coordinates, you can derive the spatial point in 3D on a unit-sphere and then calculate the "tunnel distance":

This is ultimately the same as deriving the chordal distance for the great-circle distance.

redhog commented 3 years ago

I might be a tiny bit out of my depth here, but if you want to make a variogram of, and later krige, positions in a non-euclidean 2d space that does have a euclidean embedding in 3d, why would you not convert the coordinates to that 3d embedding first, then do all calculations in euclidean 3d, then convert the results back at the end?

The earth isn't a sphere, so the simple geometrical implementations of great circles above aren't gonna give the correct results, and the user might want to be able to chose char datum dependent on their application/locality...

MuellerSeb commented 3 years ago

That is exactly the reasoning behind the yadrenko models and is also the exact way that GSTools treats lat-lon input. We provide two different dimension attributes: dim for the model dimension (3 here) and field_dim for the actual dimension (2 here)