Closed jdalrym2 closed 9 months ago
Thank you for this! It'll take me a bit of time to go through this, but overall I'm much in favor. There has been quite a bit of demand for Python bindings.
cc @rsignell-usgs
Very helpful feedback @hellkite500 ! I'll do some revisions and push the changes soon.
Some of the questions on my mind are:
exactextractr
? Box
, Grid
, and Operation
have sufficient value to Python users that it is worth exposing them, thereby imposing a stability requirement on their interface? gdal.Dataset
and gdal.Band
rather than exporting classes such as GDALRasterWrapper
to Python? Similarly, would producing a pandas data frame be preferable to writing a CSV?
- Is this the right level of detail to expose, or should we prefer a higher-level interface such as the one exposed in
exactextractr
?
Are you suggesting expoing a single function interface which takes some "standard" python representation of inputs and outputs a single dataframe (or possibly a geodataframe?), similar to this one in exactextractr
- Do classes such as
Box
,Grid
, andOperation
have sufficient value to Python users that it is worth exposing them, thereby imposing a stability requirement on their interface?
I think some of these definitely do, especially the Grid.
. The Operation
class, could be argued that exposing that to Python may not be required, as most Operation
s could be proxyed to use the user behind some convenience functions/scripts.
That being said, these are part of the public
library API, so are you suggesting that libexactextract
isn't intended to be stable? Also, we can pin the python version to the library version, and it would just follow the semantic versioning of the lib. If the public API of the C++ classes change, that changes the libexactextract
semantic version, forcing a change in the python version (and possible the binding code supporting it). As far as ensuring the python bindings stay up-to-date as the public API of the lib evolves, it requires testing the bindings and updating them as the API they bind changes. Is that level of maintenance a concern here in this repository? We could do that via a separate repository and pin an exactextract
git submodule for the stable version of the lib we are binding.
- Should we deal with types such as
gdal.Dataset
andgdal.Band
rather than exporting classes such asGDALRasterWrapper
to Python? Similarly, would producing a pandas data frame be preferable to writing a CSV?
I would be interested in looking at fiona
, geopandas
, and python gdal
practices and see if working with something a little more "native" would make sense and be practical.
Are you suggesting expoing a single function interface which takes some "standard" python representation of inputs and outputs a single dataframe (or possibly a geodataframe?), similar to this one in exactextractr
My default implementation design would be a Python function that largely mimics the exact_extract
function in R. I think this has been fairly successful in R, so I wouldn't change it without a reason to do so. The implementation of that Python function may use what I consider to be low-level classes (Grid
, Operation
, etc.) but I would not expect a Python end user to work with these directly. It might make sense to move some of the logic in the R package into this repository; for example, code supporting the re-use of calculated coverage fractions for multiple raster inputs.
are you suggesting that libexactextract isn't intended to be stable?
It has no versioned releases, so I don't think there's a particular stability contract.
It has no versioned releases, so I don't think there's a particular stability contract.
Fair enough, I never really thought about it that way, haha.
Thanks for indulging this discussion, as it helps clear up the various use cases and the best way to serve them. I'm definitely in a weird place thinking through this having spent some time really trying to understand the backend (exactextract) and the the various ways to use it, especially WRT coverage fraction and its extraction/reuse. Would be happy to pick up that discussion on a separate thread.
I'm still kind of curious if there are use cases to take advantage of the Grid
abstractions outside the context of just raster stats. @jdalrym2 as the OP of this PR, did you have some intended use case outside what you walked through in the documentation sections of this PR? I think you did essentially what @dbaston was proposing in exposing a python interface to mimic the CLI
interface, but I think the proposition is to move that implementation more into a standard C++ function/class and use that to serve to the bindings to both R and Python.
Sorry I'm late to the conversation.
Is this the right level of detail to expose, or should we prefer a higher-level interface such as the one exposed in exactextractr?
Admittedly my process of binding this was just writing out the bindings for classes until I figured out how to make a workflow that worked. I didn't know enough of the library to make high level decisions like this. I actually didn't know exactextractr
existed until I was almost done. I will look into your interface there and form an opinion. It makes a lot of sense to mirror this from a maintainability perspective. My only holdup would be if there are Python antipatterns in your interface there. I'll return to this.
Do classes such as Box, Grid, and Operation have sufficient value to Python users that it is worth exposing them, thereby imposing a stability requirement on their interface?
Operation is needed for the way I'm binding things now (but this can change), but I'd actually agree Box
and Grid
might be too low-level for this interface (@hellkite500 I'm curious if you have a proposed Python use-case for Grid
). I bound those classes first and realized toward the end that they weren't needed for the final interface / workflow.
Should we deal with types such as gdal.Dataset and gdal.Band rather than exporting classes such as GDALRasterWrapper to Python? Similarly, would producing a pandas data frame be preferable to writing a CSV?
I deal with those types with my adapter classes (in the pure Python portion). Handling them directly without an adapter and without exposing GDALRasterWrapper
, etc, requires passing off memory addresses through the pybind interface (at least that's the only way I've been able to make it work), and it ended up being too hacky for comfort. Exposing GDALRasterWrapper
and writing Python code to help the user make these seemed like the best bet.
Further down the discussion:
It has no versioned releases, so I don't think there's a particular stability contract.
I'm curious if you'd like to start now. Having specific versioned releases could make this a good candidate for hosting on PyPI. Another option would be to move all of this to a separate repo and keeping the main repo as a submodule. The separate repo would have versioned releases tied to specific pinned commit hashes in this repo.
@jdalrym2 as the OP of this PR, did you have some intended use case outside what you walked through in the documentation sections of this PR?
No I didn't. I just started binding things until I got a solid workflow going. Admittedly little forethought. I actually considered removing the bindings that aren't directly in the workflow before submitting this PR, but decided against it pending feedback as to not erase work.
... but I think the proposition is to move that implementation more into a standard C++ function/class and use that to serve to the bindings to both R and Python.
I think this is the ideal move, but I'm definitely hesitant based on how much work this might involve. Perhaps you or @dbaston are rockstar C++ developers that could knock this out in an hour, but for me that'd be much, much longer. ;)
I'll return to this.
That was actually less to review than I thought. I tend not to be a fan of "all in one go" function calls, in-case there is setup overhead and the user would like to reuse existing classes or state. Idk if that really applies here, but I tend to lean toward a "setup first, then go second" sort of design pattern as opposed to massive one-liners. I don't really think there's a right answer here, and honestly I like the idea of doing both: a high level interface that mirrors exactextractr
closely that calls a somewhat lower-level, but still exposed interface in case the user desires more granularity.
@jdalrym2 I've recently tested the python bindings with a multi-band .tif file, and the processor module does not appear to be correctly processing multiple raster bands within the .tif file together properly when specified. Essentially, when a user initializes multiple rasters and operations for two separate bands within a .tif file, they are not properly processed within the "FeatureSequentialProcessor" step as you call both operations (e.g. FeatureSequentialProcessor(dsw, writer, [op1,op2]) for two separate raster bands. The processor output essentially only yields the first raster band output for both operations. However, when each raster operation is processed separately (op1, op2), they both produce the correct output values (when compared to results with original C+ ExactExtract executable). My only guess to this initial problem is that the band index isn't being properly processed with the "processor.py" module when multiple raster operations are ingested, but I am trying to narrow down potential coding problem now. I just wanted to loop you into this problem and see if you had any suggestions as to what may going on here. I would mention that the functionality for multiple rasters within a netcdf file does work properly for the ExactExtract python bindings, so this issue is isolated only for .tif files currently.
@jduckerOWP Thanks for the info and for looking into this! FWIW, I might not get to take a look for a couple days, but will follow up if I find anything.
Overall: I agree it's not behaving as expected and I don't believe I have tested my bindings on anything other than a single band so far.
UPDATE (@jduckerOWP ) I was able to write a unit test to replicate your issue and fix it. Please see commit e278076 for more details.
Interestingly, the bug seemed to be in the exactextract code itself (@dbaston pls look), where Operation
objects were keyed in the StatsRegistry
by the name
of their parent RasterSource
(i.e. GDALRasterWrapper
). However, in the case of TIF files, the band number did not exist in the name
, and so keys clashed for multiple operations that dealt with the same TIF. The solution was simply to append "|[bandnum]"
to the name of the parent GDALRasterWrapper
object on init.
It is a mystery to me why exactextract CLI doesn't run into this issue. I did not dig into that, however.
@jdalrym2 Thank you for looking into this issue further! I've been stumped as to narrowing down the exact problem due to the fact that the exactextract CLI did not reproduce the same issue (which led me to believe that it had to initially had to be a bug in the python bindings, although nothing stood out as an issue through a debugger). The details you've described here completely make sense to me now with the underlying issue within the exactextract code. I'm glad it wasn't a significant bug in the exactextract code, but rather the way the operation objects were created with the proper register keys for each band of the TIF file. I have made the adjustments to the gdal_raster_wrapper.cpp file and have confirmed from my end as well that this code does indeed resolve the issues with specifying band numbers (tested TIF file with 8 different bands, all came out correctly) for a TIF file. I appreciate the time and support resolving this issue quickly with the python bindings functionality for TIF files. @hellkite500 is working on revising the ExactExtract branch that includes coverage fraction weights file output to be included as part of the master branch functionality for ExactExtract. At that point, we can reproduce the python bindings as well for this functionality in the future. Thanks again your collaboration efforts with this project!
$ git clone -b add-pybindings-jd --single-branch https://github.com/jdalrym2/exactextract.git
...
$ cmake -DBUILD_CLI:=OFF -DBUILD_DOC:=OFF -DBUILD_PYTHON:=ON -DCMAKE_BUILD_TYPE=Release ..
$ make
...
[100%] Linking CXX shared module _exactextract.cpython-36m-x86_64-linux-gnu.so
lto-wrapper: warning: using serial compilation of 9 LTRANS jobs
lto-wrapper: note: see the β-fltoβ option documentation for more information
[100%] Built target _exactextract
61 81.3s exactextract:/cmake-build-release (add-pybindings-jd|β) $ make install
make: *** No rule to make target 'install'. Stop.
It looks like I end up with two files:
Where do I copy them to install it? I'm using a multi-stage docker setup with osgeo/gdal:ubuntu-small as builder and final base image
I'm guessing:
@chapmanjacobd Please try the instructions in the README.md in the python/ directory (using python3 bdist). That should resolve the placement of files automatically within your environment.
If that doesn't work feel free to follow up.
Ohh... the python bindings don't support Virtual File Systems?
def get_ds_path(ds: Union[gdal.Dataset, ogr.DataSource]):
"""
Get the file path of an OSGeo dataset
Args:
ds (Union[gdal.Dataset, ogr.DataSource]): Input OSGeo dataset
Returns:
str: File path to dataset
"""
if isinstance(ds, gdal.Dataset):
return ds.GetFileList()[0] # type: ignore
else:
return ds.GetName()
https://gdal.org/doxygen/classGDALDataset.html#aefe21eb2d1c87304a14f81ee2e6d9717
I am extremely grateful that these python bindings exist but also I'm a little surprised by this. If I pass in a dataset and GDAL is doing fine, does it not make sense to pass that data through rather than try to re-open it?
@chapmanjacobd Direct passthrough of the GDALDataset object isn't possible due to how the different Python bindings work (I tried, it got hacky really quick). So a path must be extracted and then passed through with a new GDALDataset object created on the C++ side. I'd expect a virtual path would work as well... just haven't tested or implemented it. These bindings are pretty new, and were narrowly scoped for what I needed to do.
Feel free to implement it if you so wish and PR to my branch and/or write an issue and I can go take a look when I get the time. Thanks!
Just to provide an update here: VFS works for gdal.Open
-able paths with this patch. Thanks again for putting this together ! :)
fwiw, on the "right level of detail to expose" - I wish exactextractr didn't use the raster/terra format for input of extent,dimension,crs or for container of output. I wish there was an intermediate layer that simply did the work and returned cell index, weights, values etc. (I've done this for fasterize, for example, and ideally this core would live in the same package). I'll be exploring how that could work (not a huge amount of work to do it, but big changes if adopted!).
I hope the python side stays as modular as possible so we can re-use the excellent framework here in various ways.
@mdsumner I have some work for getting at weights/index values that might help with your concept. I've been meaning to get that rebased and into a PR. I'll see if I can't put in a little time on that soon.
I wish there was an intermediate layer that simply did the work and returned cell index, weights, values etc.
Unless I'm missing something, this can be done in R with exact_extract(rast, poly, include_cell = TRUE)
. I would expose the same in python.
I don't want to input the rast object, rather input just six numbers: dimension and extent, it would give weights and cell index - for reuse against values independently. Same could be true for sf, lazily stream in geos pointers (for example). But, I don't wish to hijack this topic, I'll actually do a small example and pursue discussion elsewhere (I'm very keen to discuss generally for R and Python development, this is an extremely powerful tool πͺ)
I'll hijack the thread. ;) There's been lots of discussion, but I'm definitely curious how I could support a tangible path forward. It seems like some combination of two options could be viable:
libexactextract
has some kind of generalized API that both the R and Python bindings adhere to.Any thoughts here? @hellkite500 , it sounds like you were working something related to this as well.
My ideal path is to move some functionality from exactextractr
back into the main library, then expose Python bindings that operate on "standard" python data types (geopandas/geojson polygons, gdal/xarray gridded data.) I've put in a proposal for funding to work on it, so...we'll see?
ow, I see that all the GDAL read/write is actually in this lib, that's why my "unpack" wish would seem weird - in terms of coverage_fraction I only want this interface, 6 numbers (extent, dimension) and a GEOS pointer, I can use this no problem:
// exactextract/src/raster_cell_intersection.h
// ...
RasterCellIntersection(const Grid<bounded_extent> &raster_grid, GEOSContextHandle_t context, const GEOSGeometry *g);
so glad the lib is separated like this - will be following movements here, thanks for the feedback!
I finally got my previous work organized and rebased, and I have put into #40. It may not be directly relevant to what @mdsumner is interested in, but it should provide a mechanism to facilitate obtaining the coverage fraction via the current Operation semantics, which I think would allow one to get to it via the python bindings here. Don't quote me on any of that just yet, I haven't fully gotten back up to speed on this thread (I worked on the coverage fraction code almost a year ago, and just now finally got it pushed up π’ )
Commenting to express excitement for this!
sorry to be a pain! any update on this?
There has been a great deal of work done on the ExactExtract python bindings and thank you @jdalrym2 for leading this effort. There seems to be a significant interest (myself included) in integrating the latest ExactExtract code including coverage fraction estimates https://github.com/isciences/exactextract/pull/40) provided by @hellkite500 as the latest python bindings branch. In general, ExactExtract code base here has many potentials for the scientific community to approach rasterization techniques over a variety of domain configurations. It may be very useful to schedule a meeting to go over a brief code review of the latest branches we can integrate to the master branch and the python bindings. @dbaston @jdalrym2 @mdsumner @hellkite500 would you be interested in scheduling a meeting to go over potential initiatives moving forward for the ExactExtract repository?
I'd be happy to, sure!
Sure, feel free to contact me at dbaston@isciences.com
I would be happy to! Just let me know!
I'm keen, an observer atm but I am able to contribute π
@jdalrym2 @dbaston @hellkite500 @mdsumner Thank you all for your prompt responses to this thread! I've just sent a doodle poll to your email address attached to the GitHub account just now to see if we can find a time to fit in everyone's schedule here for next week. If you can respond to the doodle poll by this weekend, then I'll be sure to quickly schedule a meeting time for everyone's calendar next week if we can conquer on a time slot. Have a nice weekend.
@jduckerOWP I didn't receive this (checked spam)
@dbaston @mdsumner I apologize if the Doodle poll invites didn't go through to your emails. Thank you for letting me know about this I have attached here the doodle link for you to fill out: https://doodle.com/meeting/participate/id/ermk0EWd
ah thanks, I filled the doodle - it's not great windows of opportunity for me so go ahead without me if I can't actually do it at the time
@jdalrym2 @dbaston @hellkite500 I just sent out an google meets invite to your emails for tomorrow at 10am-11-am EST (Dan I used your gmail and isciences email for an invite). Sorry for for the quick turnaround with this meeting but we were constrained to only essentially this time slot. Thanks for your responses @mdsumner and weβll update you on the discussions after our meeting. Let me know if there are any issues with this time slot. I look forward to talking to you more tomorrow with initiatives for ExactExtract moving forward.
please share the invite you never know I might make it, username at gmail ty π
@mdsumner thank you for asking about this I meant to paste the meeting link in the previous message. ExactExtract Discussion and Initiatives Tuesday, Mar 28 Β· 10β11 AM Google Meet joining info Video call link: https://meet.google.com/sfn-xtwn-weg Or dial: +1 413-453-4733 PIN: 364518930 More phone numbers: https://tel.meet/sfn-xtwn-weg?pin=3747720308786
Hopefully you can make it @mdsumner !
thanks for a great discussion!
@dbaston I was wondering could we set up Discussions here for ongoing chats about related topics? happy to host elsewhere - I understand that a popular library like this it might look like a support forum ... I just think this is a good home for us to share ideas for now
@mdsumner I've just enabled Discussions, or at least I think I have
@dbaston Thx! Probably a better place to discuss than a random PR. ;)
Hello there! Exciting work happening here. Just wondering, is there any update on this side? Keep on the great work! Cheers
I've recently received a funding grant to produce Python bindings. I had originally planned to include something based on pybind11 in this project, but I'm also considering moving some of this library's functionality into GDAL so it would be available through those bindings. I don't have a concrete proposal yet.
Thanks! congrats on the funding. Looking forward!
great news @dbaston btw I think we could combine the fasterize and exactextract approaches in one C++ facility, hoping to expand on that and contribute in some way
Yep lmk if you need any pybind11 help. I know GDAL uses SWIG which may, for better or worse, negate a lot of this work.
Continued in https://github.com/isciences/exactextract/pull/53
Hi there! This is a wonderful project and wanted to take the time to create a Python wrapper (#7) to use in my project. I additionally included a set of unit tests and documentation for the wrapper. Let me know if you'd like to get this merged. Thanks!
Jon