kadyb / vector-benchmark

Vector processing benchmarks for Python and R packages
https://kadyb.github.io/vector-benchmark/report.html
MIT License
19 stars 4 forks source link

Clarification of benchmarks #4

Open martinfleis opened 2 years ago

martinfleis commented 2 years ago

Hi,

I'll make a PR changing some of the geopandas benchmarks to more performant versions but before that I'd like to ask for some clarifications. I understand that the benchmarks are artificial but before I'll start coding I want to make sure I understand what the main goal is.

  1. distance
    • you are trying to get a NxN matrix with pairwise distance between all points (both ways?), right?
  2. sample
    • I truly don't understand what is this trying to do :D. Are you trying to get n random points that are within the polygon? Sort-of Monte Carlo simulation?

I think I understand the rest.

martinfleis commented 2 years ago

I think I got it. See #5

kadyb commented 2 years ago

Personally, I wanted to focus on comparing the functions available in packages from a user's perspective, rather than writing the most efficient alternatives. I also think we should compare similar functions in terms of features ({sf} as a reference?). I know it's possible to write efficient code using eg. {Rcpp}, {GEOS} and {data.table}, but I think that's beyond the reach of the vast majority of users.

distance you are trying to get a NxN matrix with pairwise distance between all points (both ways?), right?

Exactly!

sample I truly don't understand what is this trying to do :D. Are you trying to get n random points that are within the polygon? Sort-of Monte Carlo simulation?

Not quite sort of Monte Carlo simulation. I think sampling points in polygons is a standard practice in GIS :P Later, the coordinates can be retrieved from these geometries, or they can be used to extract values from the raster. Please check out sf::st_sample() as a reference. Ideally, you would implement this as a function in {geopandas}.

martinfleis commented 2 years ago

Personally, I wanted to focus on comparing the functions available in packages from a user's perspective, rather than writing the most efficient alternatives.

Yup, I've used only functions that are available. As you can see from the discussion on intersects, there could be even faster options.

compare similar functions in terms of features ({sf} as a reference?)

As far as I know, the intersects in sf uses spatial index under the hood, that is why I opted to use it as well. But I understand if you ignore that solution :).

Ideally, you would implement this as a function in {geopandas}.

We don't have anything like this right now but the code I used in #5, replacing your custom loop, is likely quite close to how it would look like if we had it (I'll open an issue to add it in future).

kadyb commented 2 years ago

As far as I know, the intersects in sf uses spatial index under the hood, that is why I opted to use it as well. But I understand if you ignore that solution :).

My mistake, in that case {geopandas} should also use spatial indexes. Not sure if {terra} works the same way, but I believe it does. Edit: {terra} doesn't use spatial indexes.

By "compare similar functions in terms of features", I meant that the functions in {terra} and {sf} have more options (arguments), so I suspect there will be overhead (but probably negligible) due to conditions/transformations.