isciences / exactextract

Fast and accurate raster zonal statistics
Apache License 2.0
246 stars 32 forks source link

Add Python Bindings #34

Closed jdalrym2 closed 9 months ago

jdalrym2 commented 2 years ago

Hi there! This is a wonderful project and wanted to take the time to create a Python wrapper (#7) to use in my project. I additionally included a set of unit tests and documentation for the wrapper. Let me know if you'd like to get this merged. Thanks!

Jon

dbaston commented 2 years ago

Thank you for this! It'll take me a bit of time to go through this, but overall I'm much in favor. There has been quite a bit of demand for Python bindings.

cc @rsignell-usgs

jdalrym2 commented 2 years ago

Very helpful feedback @hellkite500 ! I'll do some revisions and push the changes soon.

dbaston commented 2 years ago

Some of the questions on my mind are:

hellkite500 commented 2 years ago

Are you suggesting expoing a single function interface which takes some "standard" python representation of inputs and outputs a single dataframe (or possibly a geodataframe?), similar to this one in exactextractr

  • Do classes such as Box, Grid, and Operation have sufficient value to Python users that it is worth exposing them, thereby imposing a stability requirement on their interface?

I think some of these definitely do, especially the Grid.. The Operation class, could be argued that exposing that to Python may not be required, as most Operations could be proxyed to use the user behind some convenience functions/scripts.

That being said, these are part of the public library API, so are you suggesting that libexactextract isn't intended to be stable? Also, we can pin the python version to the library version, and it would just follow the semantic versioning of the lib. If the public API of the C++ classes change, that changes the libexactextract semantic version, forcing a change in the python version (and possible the binding code supporting it). As far as ensuring the python bindings stay up-to-date as the public API of the lib evolves, it requires testing the bindings and updating them as the API they bind changes. Is that level of maintenance a concern here in this repository? We could do that via a separate repository and pin an exactextract git submodule for the stable version of the lib we are binding.

  • Should we deal with types such as gdal.Dataset and gdal.Bandrather than exporting classes such as GDALRasterWrapper to Python? Similarly, would producing a pandas data frame be preferable to writing a CSV?

I would be interested in looking at fiona, geopandas, and python gdal practices and see if working with something a little more "native" would make sense and be practical.

dbaston commented 2 years ago

Are you suggesting expoing a single function interface which takes some "standard" python representation of inputs and outputs a single dataframe (or possibly a geodataframe?), similar to this one in exactextractr

My default implementation design would be a Python function that largely mimics the exact_extract function in R. I think this has been fairly successful in R, so I wouldn't change it without a reason to do so. The implementation of that Python function may use what I consider to be low-level classes (Grid, Operation, etc.) but I would not expect a Python end user to work with these directly. It might make sense to move some of the logic in the R package into this repository; for example, code supporting the re-use of calculated coverage fractions for multiple raster inputs.

are you suggesting that libexactextract isn't intended to be stable?

It has no versioned releases, so I don't think there's a particular stability contract.

hellkite500 commented 2 years ago

It has no versioned releases, so I don't think there's a particular stability contract.

Fair enough, I never really thought about it that way, haha.

Thanks for indulging this discussion, as it helps clear up the various use cases and the best way to serve them. I'm definitely in a weird place thinking through this having spent some time really trying to understand the backend (exactextract) and the the various ways to use it, especially WRT coverage fraction and its extraction/reuse. Would be happy to pick up that discussion on a separate thread.

I'm still kind of curious if there are use cases to take advantage of the Grid abstractions outside the context of just raster stats. @jdalrym2 as the OP of this PR, did you have some intended use case outside what you walked through in the documentation sections of this PR? I think you did essentially what @dbaston was proposing in exposing a python interface to mimic the CLI interface, but I think the proposition is to move that implementation more into a standard C++ function/class and use that to serve to the bindings to both R and Python.

jdalrym2 commented 2 years ago

Sorry I'm late to the conversation.

Is this the right level of detail to expose, or should we prefer a higher-level interface such as the one exposed in exactextractr?

Admittedly my process of binding this was just writing out the bindings for classes until I figured out how to make a workflow that worked. I didn't know enough of the library to make high level decisions like this. I actually didn't know exactextractr existed until I was almost done. I will look into your interface there and form an opinion. It makes a lot of sense to mirror this from a maintainability perspective. My only holdup would be if there are Python antipatterns in your interface there. I'll return to this.

Do classes such as Box, Grid, and Operation have sufficient value to Python users that it is worth exposing them, thereby imposing a stability requirement on their interface?

Operation is needed for the way I'm binding things now (but this can change), but I'd actually agree Box and Grid might be too low-level for this interface (@hellkite500 I'm curious if you have a proposed Python use-case for Grid). I bound those classes first and realized toward the end that they weren't needed for the final interface / workflow.

Should we deal with types such as gdal.Dataset and gdal.Band rather than exporting classes such as GDALRasterWrapper to Python? Similarly, would producing a pandas data frame be preferable to writing a CSV?

I deal with those types with my adapter classes (in the pure Python portion). Handling them directly without an adapter and without exposing GDALRasterWrapper, etc, requires passing off memory addresses through the pybind interface (at least that's the only way I've been able to make it work), and it ended up being too hacky for comfort. Exposing GDALRasterWrapper and writing Python code to help the user make these seemed like the best bet.

Further down the discussion:

It has no versioned releases, so I don't think there's a particular stability contract.

I'm curious if you'd like to start now. Having specific versioned releases could make this a good candidate for hosting on PyPI. Another option would be to move all of this to a separate repo and keeping the main repo as a submodule. The separate repo would have versioned releases tied to specific pinned commit hashes in this repo.

@jdalrym2 as the OP of this PR, did you have some intended use case outside what you walked through in the documentation sections of this PR?

No I didn't. I just started binding things until I got a solid workflow going. Admittedly little forethought. I actually considered removing the bindings that aren't directly in the workflow before submitting this PR, but decided against it pending feedback as to not erase work.

... but I think the proposition is to move that implementation more into a standard C++ function/class and use that to serve to the bindings to both R and Python.

I think this is the ideal move, but I'm definitely hesitant based on how much work this might involve. Perhaps you or @dbaston are rockstar C++ developers that could knock this out in an hour, but for me that'd be much, much longer. ;)

jdalrym2 commented 2 years ago

I'll return to this.

That was actually less to review than I thought. I tend not to be a fan of "all in one go" function calls, in-case there is setup overhead and the user would like to reuse existing classes or state. Idk if that really applies here, but I tend to lean toward a "setup first, then go second" sort of design pattern as opposed to massive one-liners. I don't really think there's a right answer here, and honestly I like the idea of doing both: a high level interface that mirrors exactextractr closely that calls a somewhat lower-level, but still exposed interface in case the user desires more granularity.

jduckerOWP commented 1 year ago

@jdalrym2 I've recently tested the python bindings with a multi-band .tif file, and the processor module does not appear to be correctly processing multiple raster bands within the .tif file together properly when specified. Essentially, when a user initializes multiple rasters and operations for two separate bands within a .tif file, they are not properly processed within the "FeatureSequentialProcessor" step as you call both operations (e.g. FeatureSequentialProcessor(dsw, writer, [op1,op2]) for two separate raster bands. The processor output essentially only yields the first raster band output for both operations. However, when each raster operation is processed separately (op1, op2), they both produce the correct output values (when compared to results with original C+ ExactExtract executable). My only guess to this initial problem is that the band index isn't being properly processed with the "processor.py" module when multiple raster operations are ingested, but I am trying to narrow down potential coding problem now. I just wanted to loop you into this problem and see if you had any suggestions as to what may going on here. I would mention that the functionality for multiple rasters within a netcdf file does work properly for the ExactExtract python bindings, so this issue is isolated only for .tif files currently.

jdalrym2 commented 1 year ago

@jduckerOWP Thanks for the info and for looking into this! FWIW, I might not get to take a look for a couple days, but will follow up if I find anything.

Overall: I agree it's not behaving as expected and I don't believe I have tested my bindings on anything other than a single band so far.

jdalrym2 commented 1 year ago

UPDATE (@jduckerOWP ) I was able to write a unit test to replicate your issue and fix it. Please see commit e278076 for more details.

Interestingly, the bug seemed to be in the exactextract code itself (@dbaston pls look), where Operation objects were keyed in the StatsRegistry by the name of their parent RasterSource (i.e. GDALRasterWrapper). However, in the case of TIF files, the band number did not exist in the name, and so keys clashed for multiple operations that dealt with the same TIF. The solution was simply to append "|[bandnum]" to the name of the parent GDALRasterWrapper object on init.

It is a mystery to me why exactextract CLI doesn't run into this issue. I did not dig into that, however.

jduckerOWP commented 1 year ago

@jdalrym2 Thank you for looking into this issue further! I've been stumped as to narrowing down the exact problem due to the fact that the exactextract CLI did not reproduce the same issue (which led me to believe that it had to initially had to be a bug in the python bindings, although nothing stood out as an issue through a debugger). The details you've described here completely make sense to me now with the underlying issue within the exactextract code. I'm glad it wasn't a significant bug in the exactextract code, but rather the way the operation objects were created with the proper register keys for each band of the TIF file. I have made the adjustments to the gdal_raster_wrapper.cpp file and have confirmed from my end as well that this code does indeed resolve the issues with specifying band numbers (tested TIF file with 8 different bands, all came out correctly) for a TIF file. I appreciate the time and support resolving this issue quickly with the python bindings functionality for TIF files. @hellkite500 is working on revising the ExactExtract branch that includes coverage fraction weights file output to be included as part of the master branch functionality for ExactExtract. At that point, we can reproduce the python bindings as well for this functionality in the future. Thanks again your collaboration efforts with this project!

chapmanjacobd commented 1 year ago
$ git clone -b add-pybindings-jd --single-branch https://github.com/jdalrym2/exactextract.git
...
$ cmake -DBUILD_CLI:=OFF -DBUILD_DOC:=OFF -DBUILD_PYTHON:=ON -DCMAKE_BUILD_TYPE=Release ..
$ make
...
[100%] Linking CXX shared module _exactextract.cpython-36m-x86_64-linux-gnu.so
lto-wrapper: warning: using serial compilation of 9 LTRANS jobs
lto-wrapper: note: see the β€˜-flto’ option documentation for more information
[100%] Built target _exactextract
61 81.3s exactextract:/cmake-build-release (add-pybindings-jd|βœ”) $ make install
make: *** No rule to make target 'install'.  Stop.

It looks like I end up with two files:

Where do I copy them to install it? I'm using a multi-stage docker setup with osgeo/gdal:ubuntu-small as builder and final base image

I'm guessing:

jdalrym2 commented 1 year ago

@chapmanjacobd Please try the instructions in the README.md in the python/ directory (using python3 bdist). That should resolve the placement of files automatically within your environment.

If that doesn't work feel free to follow up.

chapmanjacobd commented 1 year ago

Ohh... the python bindings don't support Virtual File Systems?

def get_ds_path(ds: Union[gdal.Dataset, ogr.DataSource]):
    """
    Get the file path of an OSGeo dataset

    Args:
        ds (Union[gdal.Dataset, ogr.DataSource]): Input OSGeo dataset

    Returns:
        str: File path to dataset
    """
    if isinstance(ds, gdal.Dataset):
        return ds.GetFileList()[0]     # type: ignore
    else:
        return ds.GetName()

https://gdal.org/doxygen/classGDALDataset.html#aefe21eb2d1c87304a14f81ee2e6d9717

I am extremely grateful that these python bindings exist but also I'm a little surprised by this. If I pass in a dataset and GDAL is doing fine, does it not make sense to pass that data through rather than try to re-open it?

jdalrym2 commented 1 year ago

@chapmanjacobd Direct passthrough of the GDALDataset object isn't possible due to how the different Python bindings work (I tried, it got hacky really quick). So a path must be extracted and then passed through with a new GDALDataset object created on the C++ side. I'd expect a virtual path would work as well... just haven't tested or implemented it. These bindings are pretty new, and were narrowly scoped for what I needed to do.

Feel free to implement it if you so wish and PR to my branch and/or write an issue and I can go take a look when I get the time. Thanks!

chapmanjacobd commented 1 year ago

Just to provide an update here: VFS works for gdal.Open-able paths with this patch. Thanks again for putting this together ! :)

mdsumner commented 1 year ago

fwiw, on the "right level of detail to expose" - I wish exactextractr didn't use the raster/terra format for input of extent,dimension,crs or for container of output. I wish there was an intermediate layer that simply did the work and returned cell index, weights, values etc. (I've done this for fasterize, for example, and ideally this core would live in the same package). I'll be exploring how that could work (not a huge amount of work to do it, but big changes if adopted!).

I hope the python side stays as modular as possible so we can re-use the excellent framework here in various ways.

hellkite500 commented 1 year ago

@mdsumner I have some work for getting at weights/index values that might help with your concept. I've been meaning to get that rebased and into a PR. I'll see if I can't put in a little time on that soon.

dbaston commented 1 year ago

I wish there was an intermediate layer that simply did the work and returned cell index, weights, values etc.

Unless I'm missing something, this can be done in R with exact_extract(rast, poly, include_cell = TRUE). I would expose the same in python.

mdsumner commented 1 year ago

I don't want to input the rast object, rather input just six numbers: dimension and extent, it would give weights and cell index - for reuse against values independently. Same could be true for sf, lazily stream in geos pointers (for example). But, I don't wish to hijack this topic, I'll actually do a small example and pursue discussion elsewhere (I'm very keen to discuss generally for R and Python development, this is an extremely powerful tool πŸ’ͺ)

jdalrym2 commented 1 year ago

I'll hijack the thread. ;) There's been lots of discussion, but I'm definitely curious how I could support a tangible path forward. It seems like some combination of two options could be viable:

  1. libexactextract has some kind of generalized API that both the R and Python bindings adhere to.
  2. The Python bindings break out into their own repo, with their own interface, and it gets maintained separately. In this case, it'd probably use this repo as a submodule, and updates would have to manually be propagated to the bindings library.

Any thoughts here? @hellkite500 , it sounds like you were working something related to this as well.

dbaston commented 1 year ago

My ideal path is to move some functionality from exactextractr back into the main library, then expose Python bindings that operate on "standard" python data types (geopandas/geojson polygons, gdal/xarray gridded data.) I've put in a proposal for funding to work on it, so...we'll see?

mdsumner commented 1 year ago

ow, I see that all the GDAL read/write is actually in this lib, that's why my "unpack" wish would seem weird - in terms of coverage_fraction I only want this interface, 6 numbers (extent, dimension) and a GEOS pointer, I can use this no problem:

// exactextract/src/raster_cell_intersection.h
// ...
 RasterCellIntersection(const Grid<bounded_extent> &raster_grid, GEOSContextHandle_t context, const GEOSGeometry *g);

so glad the lib is separated like this - will be following movements here, thanks for the feedback!

hellkite500 commented 1 year ago

I finally got my previous work organized and rebased, and I have put into #40. It may not be directly relevant to what @mdsumner is interested in, but it should provide a mechanism to facilitate obtaining the coverage fraction via the current Operation semantics, which I think would allow one to get to it via the python bindings here. Don't quote me on any of that just yet, I haven't fully gotten back up to speed on this thread (I worked on the coverage fraction code almost a year ago, and just now finally got it pushed up 😒 )

BwL1289 commented 1 year ago

Commenting to express excitement for this!

BwL1289 commented 1 year ago

sorry to be a pain! any update on this?

jduckerOWP commented 1 year ago

There has been a great deal of work done on the ExactExtract python bindings and thank you @jdalrym2 for leading this effort. There seems to be a significant interest (myself included) in integrating the latest ExactExtract code including coverage fraction estimates https://github.com/isciences/exactextract/pull/40) provided by @hellkite500 as the latest python bindings branch. In general, ExactExtract code base here has many potentials for the scientific community to approach rasterization techniques over a variety of domain configurations. It may be very useful to schedule a meeting to go over a brief code review of the latest branches we can integrate to the master branch and the python bindings. @dbaston @jdalrym2 @mdsumner @hellkite500 would you be interested in scheduling a meeting to go over potential initiatives moving forward for the ExactExtract repository?

jdalrym2 commented 1 year ago

I'd be happy to, sure!

dbaston commented 1 year ago

Sure, feel free to contact me at dbaston@isciences.com

hellkite500 commented 1 year ago

I would be happy to! Just let me know!

mdsumner commented 1 year ago

I'm keen, an observer atm but I am able to contribute πŸ‘Œ

jduckerOWP commented 1 year ago

@jdalrym2 @dbaston @hellkite500 @mdsumner Thank you all for your prompt responses to this thread! I've just sent a doodle poll to your email address attached to the GitHub account just now to see if we can find a time to fit in everyone's schedule here for next week. If you can respond to the doodle poll by this weekend, then I'll be sure to quickly schedule a meeting time for everyone's calendar next week if we can conquer on a time slot. Have a nice weekend.

dbaston commented 1 year ago

@jduckerOWP I didn't receive this (checked spam)

jduckerOWP commented 1 year ago

@dbaston @mdsumner I apologize if the Doodle poll invites didn't go through to your emails. Thank you for letting me know about this I have attached here the doodle link for you to fill out: https://doodle.com/meeting/participate/id/ermk0EWd

mdsumner commented 1 year ago

ah thanks, I filled the doodle - it's not great windows of opportunity for me so go ahead without me if I can't actually do it at the time

jduckerOWP commented 1 year ago

@jdalrym2 @dbaston @hellkite500 I just sent out an google meets invite to your emails for tomorrow at 10am-11-am EST (Dan I used your gmail and isciences email for an invite). Sorry for for the quick turnaround with this meeting but we were constrained to only essentially this time slot. Thanks for your responses @mdsumner and we’ll update you on the discussions after our meeting. Let me know if there are any issues with this time slot. I look forward to talking to you more tomorrow with initiatives for ExactExtract moving forward.

mdsumner commented 1 year ago

please share the invite you never know I might make it, username at gmail ty πŸ™

jduckerOWP commented 1 year ago

@mdsumner thank you for asking about this I meant to paste the meeting link in the previous message. ExactExtract Discussion and Initiatives Tuesday, Mar 28 Β· 10–11 AM Google Meet joining info Video call link: https://meet.google.com/sfn-xtwn-weg Or dial: +1 413-453-4733 PIN: 364518930 More phone numbers: https://tel.meet/sfn-xtwn-weg?pin=3747720308786

Hopefully you can make it @mdsumner !

mdsumner commented 1 year ago

thanks for a great discussion!

@dbaston I was wondering could we set up Discussions here for ongoing chats about related topics? happy to host elsewhere - I understand that a popular library like this it might look like a support forum ... I just think this is a good home for us to share ideas for now

dbaston commented 1 year ago

@mdsumner I've just enabled Discussions, or at least I think I have

jdalrym2 commented 1 year ago

@dbaston Thx! Probably a better place to discuss than a random PR. ;)

cverluiseQB commented 11 months ago

Hello there! Exciting work happening here. Just wondering, is there any update on this side? Keep on the great work! Cheers

dbaston commented 11 months ago

I've recently received a funding grant to produce Python bindings. I had originally planned to include something based on pybind11 in this project, but I'm also considering moving some of this library's functionality into GDAL so it would be available through those bindings. I don't have a concrete proposal yet.

cverluiseQB commented 11 months ago

Thanks! congrats on the funding. Looking forward!

mdsumner commented 11 months ago

great news @dbaston btw I think we could combine the fasterize and exactextract approaches in one C++ facility, hoping to expand on that and contribute in some way

jdalrym2 commented 11 months ago

Yep lmk if you need any pybind11 help. I know GDAL uses SWIG which may, for better or worse, negate a lot of this work.

dbaston commented 9 months ago

Continued in https://github.com/isciences/exactextract/pull/53