Open emlys opened 3 weeks ago
+1 from me to create a helper function for this! It reminds me a bit of pygeoprocessing.geoprocessing.shapely_geometry_to_vector
(at least the attribute portion of it), but with a focus on setting of attributes.
But also, the apply
terminology makes me think of pandas/geopandas. At this point, especially since we are using conda for our package management, should we consider moving vector operations like this to geopandas?
My only concern with geopandas is memory usage, it doesn't seem to be designed for efficiency on arbitrarily large vectors. It's possible to read in a subset of a vector with the rows
option to geopandas.read_file
, but I'm not aware of any way to work with a whole geodataframe without reading it all into memory
Thanks for this Emily, it's a +1 from me generally. I would also be concerned with Pandas performance and efficiencies too, as the backend of GeoPandas. Here's a page where they talk about scaling. But maybe it would be interesting to do a side by side comparison with a small, medium, and large vector use case.
This does make me reflect on some conversations we've had about standardizing / building out a vector API in pygeoprocessing. We haven't given vectors the treatment that we've given rasters. I'm not saying we need to tackle that now, but am curious about whether something like this makes sense to live in PGP from the start?
My only feature recommendation would be to provide an optional "copy" argument. I could see not wanting to edit the vector directly but instead make a copy. This could be a separate step beforehand, but might be a nice convenience feature too.
It's a very common pattern that we iterate over each feature in a vector, do something with the feature's attributes and/or geometry, and write new attributes to the feature. A lot of the details of this process could be abstracted away with a wrapper function, something like
Which could be used like this (simple example from AWY):
Additional features could be
enumerated
, which if True, enables enumeration of the features. Ifenumerated
,op
would be called with(index, feature)
rather than just(feature)
. I saw a couple of cases where this would be useful.op
raises an errorI count several instances in invest where this pattern could simplify existing code, for example:
compute_water_yield_volume
compute_watershed_valuation
compute_rsupply_volume
calculate_uhi_result_vector
calculate_energy_savings
_aggregate_carbon_map
Benefits would be to reduce redundant code and to make sure we're consistently using the best patterns for working with GDAL vectors (opening and closing correctly, saving to disk, using exceptions: #638). While preserving memory efficiency, since features are processed one at a time.
This would be sort-of parallel to pygeoprocessing's raster utilities, which are much more developed than our vector utilities.