Performance optimisation of point-to-point distance

martinfleis commented 1 month ago

I've been wondering, if there is a space to optimise the distance measurement when all geometries are simple points. When using shapely and extracting coordinates to compute distance using numpy, I can get about 50x speedup.

See the example of measuring distance between 4k of points as posted in https://github.com/shapely/shapely/issues/2038.

import shapely
import numpy as np

points = shapely.points(np.random.rand(4000, 2))

%%timeit
geos = shapely.distance(points, points.reshape(-1, 1))
# 4 s ± 21.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
# linalg.norm is convenient and commonly used for this
assert (shapely.get_type_id(points) == 0).all()
coords = shapely.get_coordinates(points)
linalg = np.linalg.norm(coords[:, None] - coords, axis=2)
# 328 ms ± 4.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
# but we can also be explicit, and faster
(shapely.get_type_id(points) == 0).all()
coords = shapely.get_coordinates(points)
sqrt = np.sqrt((coords[:, None][:, :, 0] - coords[:,0])**2 + (coords[:, None][:, :,1] - coords[:,1])**2)
# 71.2 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

While we could do this in shapely, maybe there's a space to do it in GEOS?

kadyb commented 1 month ago

In R-spatial, we also noted that some alternatives can be faster than GEOS.

library("geos") # 3.11.1
library("Rfast")

n = 4000
df = data.frame(x = rnorm(n), y = rnorm(n))
pts = as_geos_geometry(sf::st_as_sf(df, coords = c("x", "y")))

geos_dist = function(x) {
  mat = matrix(0.0, nrow = length(x), ncol = length(x))
  for (i in seq_along(x)) mat[i, ] = geos::geos_distance(x[i], x)
  return(mat)
}

bench::mark(
  check = FALSE,
  geos = geos_dist(pts),
  R = as.matrix(dist(df, method = "euclidean")),
  Rfast = Rfast::Dist(df, method = "euclidean")
)
#>   expression      min   median `itr/sec` mem_alloc
#> 1 geos          6.87s    6.87s     0.146    1.01GB
#> 2 R          465.05ms 477.14ms     2.10   671.42MB
#> 3 Rfast       56.47ms  60.69ms    15.2     122.2MB

dr-jts commented 1 month ago

Is this requirement for the specific case of computing a matrix of all distances between points in a set? If so, that's not a use case for which GEOS is designed, and it would require a different kind of API and algorithm.

If I understand Shapely correctly, the loops to compute the distances between all points in the set is done on the Shapely side, not in GEOS.

dr-jts commented 1 month ago

Do you know why the numpy algorithm is so much faster? Is it using vectorized arithmetic? Or is it just avoiding overhead of dealing with geometries rather than raw numbers?

martinfleis commented 1 month ago

Is this requirement for the specific case of computing a matrix of all distances between points in a set?

No, the same applies to many-to-one measurement (distance from each point in array to a single point), with slightly different ratios:

import shapely
import numpy as np

points = shapely.points(np.random.rand(4000, 2))

%%timeit
geos = shapely.distance(points, points[0])
# 992 µs ± 4.92 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%%timeit
# linalg.norm is convenient and commonly used for this
assert (shapely.get_type_id(points) == 0).all()
coords = shapely.get_coordinates(points)
linalg = np.linalg.norm(coords - coords[0], axis=1)
# 260 µs ± 3.54 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%%timeit
# but we can also be explicit, and faster
assert (shapely.get_type_id(points) == 0).all()
coords = shapely.get_coordinates(points)
sqrt = np.sqrt((coords[:, 0] - coords[0][0])**2 + (coords[:, 1] - coords[0][1])**2)
# 221 µs ± 791 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Do you know why the numpy algorithm is so much faster? Is it using vectorized arithmetic? Or is it just avoiding overhead of dealing with geometries rather than raw numbers?

AFAIK Numpy is using vectorized operations here.

It may be the case that this sort of improvement does not belong to GEOS and shall be implemented in upstream where numpy (or some other stuff) can be be efficiently used when dealing with arrays.

dr-jts commented 1 month ago

No, the same applies to many-to-one measurement (distance from each point in array to a single point), with slightly different ratios:

Er, actually my comment was slightly too specific. The key distinction is that these tests are computing many distances, whereas the current GEOS API targets the case of computing a single distance. There's quite a bit of overhead involved with getting from Python to GEOS (I believe), so perhaps that's the source of the performance difference.

AFAIK Numpy is using vectorized operations here.

It may be the case that this sort of improvement does not belong to GEOS and shall be implemented in upstream where numpy (or some other stuff) can be be efficiently used when dealing with arrays.

Yes, that sounds like the best approach.

martinfleis commented 1 month ago

Cool thanks! Will close it here and follow up on this in shapely.

dbaston commented 1 month ago

AFAIU Shapely already has the points stored as GEOS geometries, so I wouldn't expect much overhead in doing an operation on them. I think #1067 should make a noticeable difference here.

libgeos / geos

Performance optimisation of point-to-point distance #1066