isciences / exactextract

Fast and accurate raster zonal statistics
Apache License 2.0
246 stars 32 forks source link

Python: Perf thoughts #72

Open kylebarron opened 7 months ago

kylebarron commented 7 months ago

👋 Feel free to ignore me but was just reading through the code

You're currently using the OGR bindings or Fiona to load a GeoJSON FeatureCollection to Python. Have you looked at pyogrio at all? For loading entire files it can be easily 5-10x faster than fiona because it's vectorized instead of a python loop.

In my experience, I've found that serializing GeoJSON between Python objects and native code can be really slow. What would you think about an API that (optionally) returned indices into the passed-in objects instead of the features themselves so that users can skip the overhead of serializing the data back to Python?

It would be great to release the GIL while in pure C++ code, especially if you're able to pass a whole chunk of rasters/vectors at once to C++, so you aren't acquiring and releasing the GIL on every iteration of the loop. Looking at the pybind docs it says

pybind11 will never implicitly release the GIL

so I currently your code always holds the GIL?

If you had interest, I'd love to discuss how exactextract could use GeoArrow. It's an exact candidate for why I think GeoArrow has so much potential: existing binary data in GEOS objects in GeoPandas or Shapely has to serialize to GeoJSON and the GEOS -> GeoJSON -> exactextract -> GeoJSON -> GEOS conversion is really slow. There's ongoing work to handle GeoArrow <--> GEOS interop in C/C++, which would simplify getting geometries back into GEOS for use in core exactextract.

dbaston commented 7 months ago

You're currently using the OGR bindings or Fiona to load a GeoJSON FeatureCollection to Python

If the OGR bindings are available, I'm exporting the geometry as WKB

https://github.com/isciences/exactextract/blob/53309e46b45437447c9e3bb85e03e5a5f97d6380/python/src/exactextract/feature.py#L15-L16

which is then read by GEOS:

https://github.com/isciences/exactextract/blob/53309e46b45437447c9e3bb85e03e5a5f97d6380/python/src/pybindings/feature_bindings.cpp#L152-L156

Have you looked at pyogrio at all? For loading entire files it can be easily 5-10x faster than fiona because it's vectorized instead of a python loop.

I hadn't looked at it, but it appears that you can use it like this:

>>> from pyogrio import read_dataframe
>>> from exactextract import exact_extract
>>> polys = read_dataframe('ne_10m_admin_0_countries.shp')
>>> country_pop = exact_extract('gpw_v4_population_count_rev11_2005_2pt5_min.tif', polys, 'sum')

It would be great to release the GIL while in pure C++ code, especially if you're able to pass a whole chunk of rasters/vectors at once to C++, so you aren't acquiring and releasing the GIL on every iteration of the loop

This isn't something I know much about, but I can l look into it. Having a benchmark would be helpful.

If you had interest, I'd love to discuss how exactextract could use GeoArrow. It's an exact candidate for why I think GeoArrow has so much potential: existing binary data in GEOS objects in GeoPandas or Shapely has to serialize to GeoJSON and the GEOS -> GeoJSON -> exactextract -> GeoJSON -> GEOS conversion is really slow.

I'd be happy to discuss it. To be clear, the conversion happening in this package would only be OGR Geometry -> WKB -> GEOS, unless the user has decided not to use OGR. I didn't spend much time looking at conversion options for fiona and GeoPandas because it hasn't been a significant part of runtime in my testing. I'm guessing it's somehow unsafe to just grab the GEOS geometry pointers directly from a GeoPandas dataframe?

paleolimbot commented 7 months ago

I'm guessing it's somehow unsafe to just grab the GEOS geometry pointers directly from a GeoPandas dataframe?

I think this works on conda but maybe not pypi...I think PyGEOS used to do it with Shapely before the two merged. Shapely has a C API for this but I don't know the details of how/when to use it from another package.

kylebarron commented 7 months ago

I'm guessing it's somehow unsafe to just grab the GEOS geometry pointers directly from a GeoPandas dataframe?

I believe this is only valid when you can ensure that the underlying GEOS libraries are exactly the same, but in general you can't know what version the other library is using.

PyGEOS and Shapely used to have a warning when they checked that the versions of GEOS were not the same that conversions between the two would be slower. Presumably that meant serializing to WKB and deserializing.