Open martinfleis opened 2 years ago
This is the result coming from my environment that includes pygeos, with no changes to the code (some of which will also be significant). report.html.zip
Thanks for the comment and your results! Generally, I don't expect super performance from Python and R - this is the domain of low-level languages. My idea was a simple comparison of packages for vector data processing without code optimization, i.e. I used simple functions available in the packages.
I used _Pop!OS 20.04 LTS system (based on Ubuntu 20.04 Focal Fossa) and the software available in the repository by default. I downloaded Python packages from PIP and R packages from CRAN.
I didn't use {pygeos}
. However, correct me if I'm wrong, doesn't {pygeos}
use multithreading by default, hence the speedup? All tested R packages are single-threaded, so such a comparison would be unfair. There are separate packages, eg sfurrr, that allow parallel computation, or you can write code yourself, but I did not include these cases in this benchmark. Here is a FR for {terra}
, but still not implemented.
I'm surprised how much the distance calculation performance has improved in particular, nice.
Here more information about the environment used. Let me know if anything more is needed.
> apt list --installed | grep libgeos
libgeos-3.8.0/focal,now 3.8.0-1build1 amd64 [installed,automatic]
libgeos-c1v5/focal,now 3.8.0-1build1 amd64 [installed,automatic]
libgeos-dev/focal,now 3.8.0-1build1 amd64 [installed,automatic]
> python3 -VV
Python 3.8.10 (default, Nov 26 2021, 20:14:08)
[GCC 9.3.0]
terra::gdal(lib = "all")
#> gdal proj geos
#> "3.0.4" "6.3.1" "3.8.0"
sf::sf_extSoftVersion()
#> GEOS GDAL proj.4 GDAL_with_GEOS USE_PROJ_H PROJ
#> "3.8.0" "3.0.4" "6.3.1" "true" "true" "6.3.1"
geos::geos_version()
#> [1] ‘3.10.0’
However, correct me if I'm wrong, doesn't {pygeos} use multithreading by default, hence the speedup?
No, it doesn't. Dask-geopandas would but pygeos is single-threaded, but vectorized. It is going to be shapely 2.0 and once released as such, a default geometry engine in geopandas. At the moment it is treated as experimental (though stable).
Thanks for the clarification! Honestly, I've never used {pygeos}
, I've always used {geopandas}
alone. So it will be added as a default dependency in the near future?
Yes and no :D. GeoPandas' default geometry engine is shapely. And pygeos has been integrated to shapely. So while we will never require pygeos to be installed explicitly, it will be factually installed when you install shapely 2.0 (to be released soon-ish, 95% of work is done). It is a long process aimed at consolidation of the ecosystem. Users of geopandas will get the speedup you see on my results for free essentially, without a need to change anything in their code. As you get now, if pygeos is installed.
Great, so the best solution is if I install {pygeos}
now and rerun the benchmark.
Great, so the best solution is if I install {pygeos} now and rerun the benchmark.
Ideally with the changes proposed in #5 as some of the code is not following the ideal pattern now.
@martinfleis, could you check if the results are reproducible for {sf}
and {geopandas}
(in particular, I mean with the new version of {shapely}
)? Do you also recommend removing {pygeos}
now?
The only problem I haven't noticed before is:
sys:1: FutureWarning: The 'cascaded_union' attribute is deprecated, use 'unary_union' instead /home/krzdyb/.local/lib/python3.8/site-packages/geopandas/_vectorized.py:653: UserWarning: Only Polygon objects have interior rings. For other geometry types, None is returned.
when I want to plot the points from sample.py. I see this is related to GEOS 3.3 (https://github.com/shapely/shapely/issues/1001) but I have GEOS 3.8.
apt list --installed | grep libgeos
#> libgeos-3.8.0/focal,now 3.8.0-1build1 amd64 [installed,automatic]
#> libgeos-c1v5/focal,now 3.8.0-1build1 amd64 [installed,automatic]
#> libgeos-dev/focal,now 3.8.0-1build1 amd64 [installed,automatic]
Do you also recommend removing
{pygeos}
now?
Yes, if you ensure to have shapely >= 2.0, then it's best to remove pygeos (otherwise geopandas will still use pygeos for now, giving some overhead in converting between pygeos and shapely)
The only problem I haven't noticed before is: ... when I want to plot the points from sample.py.
What code are you using to plot?
What code are you using to plot?
n = 10
smp = sample(gdf, n)
smp.plot()
But that result is supposed to only contain points, right? Not sure how that can trigger that warning ..
(you get that warning if you try to get the interiors
from a GeoSeries that contains both polygons and non-polygons, and we do call that in the plotting code, but in the latest versions of geopandas, we also first split the input based on the geometry type before plotting the geometries of each type with a custom function for that geometry type. So we should never try to get the interior of points)
Yes, points only. Anyway, the figure looks correct.
Hi, this is a great initiative.
As geopandas is currently in the state of performance migration sort of, the out of the box performance is not necessarily the best one (I'll leave another issue on that). I wanted to check the environment to see if you do have
pygeos
engine installed and what are the versions of GEOS and the libraries but it doesn't seem to be listed.How do you create an environment for these tests?