kadyb / vector-benchmark

Vector processing benchmarks for Python and R packages
https://kadyb.github.io/vector-benchmark/report.html
MIT License
18 stars 3 forks source link

Environment #3

Open martinfleis opened 2 years ago

martinfleis commented 2 years ago

Hi, this is a great initiative.

As geopandas is currently in the state of performance migration sort of, the out of the box performance is not necessarily the best one (I'll leave another issue on that). I wanted to check the environment to see if you do have pygeos engine installed and what are the versions of GEOS and the libraries but it doesn't seem to be listed.

How do you create an environment for these tests?

martinfleis commented 2 years ago

This is the result coming from my environment that includes pygeos, with no changes to the code (some of which will also be significant). report.html.zip

kadyb commented 2 years ago

Thanks for the comment and your results! Generally, I don't expect super performance from Python and R - this is the domain of low-level languages. My idea was a simple comparison of packages for vector data processing without code optimization, i.e. I used simple functions available in the packages.

I used _Pop!OS 20.04 LTS system (based on Ubuntu 20.04 Focal Fossa) and the software available in the repository by default. I downloaded Python packages from PIP and R packages from CRAN.

I didn't use {pygeos}. However, correct me if I'm wrong, doesn't {pygeos} use multithreading by default, hence the speedup? All tested R packages are single-threaded, so such a comparison would be unfair. There are separate packages, eg sfurrr, that allow parallel computation, or you can write code yourself, but I did not include these cases in this benchmark. Here is a FR for {terra}, but still not implemented.

I'm surprised how much the distance calculation performance has improved in particular, nice.

kadyb commented 2 years ago

Here more information about the environment used. Let me know if anything more is needed.

> apt list --installed | grep libgeos
libgeos-3.8.0/focal,now 3.8.0-1build1 amd64 [installed,automatic]
libgeos-c1v5/focal,now 3.8.0-1build1 amd64 [installed,automatic]
libgeos-dev/focal,now 3.8.0-1build1 amd64 [installed,automatic]

> python3 -VV
Python 3.8.10 (default, Nov 26 2021, 20:14:08) 
[GCC 9.3.0]
terra::gdal(lib = "all")
#>    gdal    proj    geos
#> "3.0.4" "6.3.1" "3.8.0"
sf::sf_extSoftVersion()
#>    GEOS     GDAL   proj.4  GDAL_with_GEOS  USE_PROJ_H     PROJ
#> "3.8.0"  "3.0.4"  "6.3.1"          "true"      "true"  "6.3.1"
geos::geos_version()
#> [1] ‘3.10.0’
Python packages ``` > pip list Package Version ----------------------- ---------------------------------- affine 2.3.0 appdirs 1.4.4 attrs 19.3.0 beautifulsoup4 4.8.2 blinker 1.4 Brlapi 0.7.0 cachetools 4.2.2 certifi 2019.11.28 cftime 1.4.1 chardet 3.0.4 chrome-gnome-shell 0.0.0 click 8.0.3 click-plugins 1.1.1 cligj 0.7.2 cloudpickle 1.6.0 colorama 0.4.3 command-not-found 0.3 cryptography 2.8 cupshelpers 1.0 cycler 0.10.0 dask 2021.4.1 datacube 1.8.3 dbus-python 1.2.16 decorator 4.4.2 defer 1.0.6 distributed 2021.4.1 distro 1.4.0 entrypoints 0.3 Fiona 1.8.21 fsspec 2021.4.0 future 0.18.2 GDAL 3.0.4 geocube 0.0.16 geopandas 0.9.0 gpg 1.13.1-unknown greenlet 1.0.0 HeapDict 1.0.1 hidpidaemon 18.4.6 html5lib 1.0.1 httplib2 0.14.0 idna 2.8 importlib-metadata 1.5.0 ipython-genutils 0.2.0 Jinja2 2.10.1 jsonschema 3.2.0 jupyter-core 4.6.3 keyring 18.0.1 kiwisolver 1.0.1 language-selector 0.1 lark-parser 0.11.2 launchpadlib 1.10.13 lazr.restfulclient 0.14.2 lazr.uri 1.0.3 locket 0.2.1 louis 3.12.0 lxml 4.5.0 macaroonbakery 1.3.1 MarkupSafe 1.1.0 matplotlib 3.1.2 more-itertools 4.2.0 msgpack 1.0.2 munch 2.5.0 nbformat 5.0.4 netCDF4 1.5.6 netifaces 0.10.4 numpy 1.17.4 oauthlib 3.1.0 olefile 0.46 OWSLib 0.19.1 packaging 21.2 pandas 1.2.2 partd 1.2.0 Pillow 7.0.0 pip 20.0.2 plotly 4.4.1 pop-transition 1.1.2 protobuf 3.6.1 psutil 5.8.0 psycopg2 2.8.4 pycairo 1.16.2 pycups 1.9.73 pydbus 0.6.0 Pygments 2.3.1 PyGObject 3.36.0 PyJWT 1.7.1 pymacaroons 0.13.0 PyNaCl 1.3.0 PyOpenGL 3.1.0 pyparsing 2.4.6 pyproj 2.5.0 PyQt5 5.14.1 pyRFC3339 1.1 pyrsistent 0.15.5 python-apt 2.1.2pop0-1587756471-20.04-cd2988e python-dateutil 2.7.3 python-debian 0.1.36ubuntu1 python-xlib 0.23 pytz 2019.3 pyxdg 0.26 PyYAML 5.3.1 rasterio 1.2.10 rasterstats 0.16.0 repoman 1.2.2 requests 2.22.0 requests-unixsocket 0.2.0 retrying 1.3.3 rioxarray 0.10.0 scipy 1.6.3 screen-resolution-extra 0.0.0 SecretStorage 2.3.1 sessioninstaller 0.0.0 setuptools 45.2.0 Shapely 1.7.1 simplejson 3.16.0 sip 4.19.21 six 1.14.0 snuggs 1.4.7 sortedcontainers 2.3.0 soupsieve 1.9.5 SQLAlchemy 1.4.12 ssh-import-id 5.10 systemd-python 234 tblib 1.7.0 toolz 0.11.1 tornado 6.1 traitlets 4.3.3 ubuntu-advantage-tools 27.6 ubuntu-drivers-common 0.0.0 ufw 0.36 urllib3 1.25.8 wadllib 1.3.3 webencodings 0.5.1 wheel 0.34.2 wxPython 4.0.7 xarray 0.17.0 xkit 0.0.0 zict 2.0.0 zipp 1.0.0 ```
martinfleis commented 2 years ago

However, correct me if I'm wrong, doesn't {pygeos} use multithreading by default, hence the speedup?

No, it doesn't. Dask-geopandas would but pygeos is single-threaded, but vectorized. It is going to be shapely 2.0 and once released as such, a default geometry engine in geopandas. At the moment it is treated as experimental (though stable).

kadyb commented 2 years ago

Thanks for the clarification! Honestly, I've never used {pygeos}, I've always used {geopandas} alone. So it will be added as a default dependency in the near future?

martinfleis commented 2 years ago

Yes and no :D. GeoPandas' default geometry engine is shapely. And pygeos has been integrated to shapely. So while we will never require pygeos to be installed explicitly, it will be factually installed when you install shapely 2.0 (to be released soon-ish, 95% of work is done). It is a long process aimed at consolidation of the ecosystem. Users of geopandas will get the speedup you see on my results for free essentially, without a need to change anything in their code. As you get now, if pygeos is installed.

kadyb commented 2 years ago

Great, so the best solution is if I install {pygeos} now and rerun the benchmark.

martinfleis commented 2 years ago

Great, so the best solution is if I install {pygeos} now and rerun the benchmark.

Ideally with the changes proposed in #5 as some of the code is not following the ideal pattern now.

kadyb commented 1 year ago

@martinfleis, could you check if the results are reproducible for {sf} and {geopandas} (in particular, I mean with the new version of {shapely})? Do you also recommend removing {pygeos} now?

The only problem I haven't noticed before is:

sys:1: FutureWarning: The 'cascaded_union' attribute is deprecated, use 'unary_union' instead /home/krzdyb/.local/lib/python3.8/site-packages/geopandas/_vectorized.py:653: UserWarning: Only Polygon objects have interior rings. For other geometry types, None is returned.

when I want to plot the points from sample.py. I see this is related to GEOS 3.3 (https://github.com/shapely/shapely/issues/1001) but I have GEOS 3.8.

apt list --installed | grep libgeos
#> libgeos-3.8.0/focal,now 3.8.0-1build1 amd64 [installed,automatic]
#> libgeos-c1v5/focal,now 3.8.0-1build1 amd64 [installed,automatic]
#> libgeos-dev/focal,now 3.8.0-1build1 amd64 [installed,automatic]
jorisvandenbossche commented 1 year ago

Do you also recommend removing {pygeos} now?

Yes, if you ensure to have shapely >= 2.0, then it's best to remove pygeos (otherwise geopandas will still use pygeos for now, giving some overhead in converting between pygeos and shapely)

jorisvandenbossche commented 1 year ago

The only problem I haven't noticed before is: ... when I want to plot the points from sample.py.

What code are you using to plot?

kadyb commented 1 year ago

What code are you using to plot?

n = 10
smp = sample(gdf, n)
smp.plot()
jorisvandenbossche commented 1 year ago

But that result is supposed to only contain points, right? Not sure how that can trigger that warning .. (you get that warning if you try to get the interiors from a GeoSeries that contains both polygons and non-polygons, and we do call that in the plotting code, but in the latest versions of geopandas, we also first split the input based on the geometry type before plotting the geometries of each type with a custom function for that geometry type. So we should never try to get the interior of points)

kadyb commented 1 year ago

Yes, points only. Anyway, the figure looks correct.