Proposal: utility to apply a function to each feature in a vector

emlys commented 3 weeks ago

It's a very common pattern that we iterate over each feature in a vector, do something with the feature's attributes and/or geometry, and write new attributes to the feature. A lot of the details of this process could be abstracted away with a wrapper function, something like

def vector_apply(vector_path, op, new_fields=[]):
    vector = gdal.OpenEx(vector_path, gdal.OF_VECTOR | gdal.GA_Update)
    layer = vector.GetLayer()

    for field_name, field_type in new_fields:
        field_defn = ogr.FieldDefn(field_name, field_type)
        field_defn.SetWidth(24)
        field_defn.SetPrecision(11)
        layer.CreateField(field_defn)

    layer.ResetReading()
    layer.StartTransaction()

    # add error handling - finally write to file and save
    # add timed logging
    for feature in layer:
        op(feature)
        layer.SetFeature(feature)

    layer.CommitTransaction()
    layer.SyncToDisk()

Which could be used like this (simple example from AWY):

def compute_water_yield_volume(watershed_results_vector_path):
    vol_name = 'wyield_vol'
    def wyield_vol_op(feat):
        wyield_mn = feat.GetField('wyield_mn')
        # there won't be a wyield_mn value if the polygon feature only
        # covers nodata raster values, so check before doing math.
        if wyield_mn is not None:
            geom = feat.GetGeometryRef()
            # Calculate water yield volume,
            # 1000 is for converting the mm of wyield to meters
            vol = wyield_mn * geom.Area() / 1000
            # Get the volume field index and add value
            feat.SetField(vol_name, vol)

    utils.vector_apply(
        watershed_results_vector_path, wyield_vol_op,
        new_fields=[(vol_name, ogr.OFTReal)])

Additional features could be

a boolean arg enumerated, which if True, enables enumeration of the features. If enumerated, op would be called with (index, feature) rather than just (feature). I saw a couple of cases where this would be useful.
Error handling to make sure that progress is saved and written to disk if op raises an error
Timed progress logging
Handle setting shapefile field width and precision for different dtypes

I count several instances in invest where this pattern could simplify existing code, for example:

AWY compute_water_yield_volume
AWY compute_watershed_valuation
AWY compute_rsupply_volume
UCM calculate_uhi_result_vector
UCM calculate_energy_savings
Forest carbon _aggregate_carbon_map

Benefits would be to reduce redundant code and to make sure we're consistently using the best patterns for working with GDAL vectors (opening and closing correctly, saving to disk, using exceptions: #638). While preserving memory efficiency, since features are processed one at a time.

This would be sort-of parallel to pygeoprocessing's raster utilities, which are much more developed than our vector utilities.

phargogh commented 3 weeks ago

+1 from me to create a helper function for this! It reminds me a bit of pygeoprocessing.geoprocessing.shapely_geometry_to_vector (at least the attribute portion of it), but with a focus on setting of attributes.

But also, the apply terminology makes me think of pandas/geopandas. At this point, especially since we are using conda for our package management, should we consider moving vector operations like this to geopandas?

emlys commented 3 weeks ago

My only concern with geopandas is memory usage, it doesn't seem to be designed for efficiency on arbitrarily large vectors. It's possible to read in a subset of a vector with the rows option to geopandas.read_file, but I'm not aware of any way to work with a whole geodataframe without reading it all into memory

dcdenu4 commented 1 week ago

Thanks for this Emily, it's a +1 from me generally. I would also be concerned with Pandas performance and efficiencies too, as the backend of GeoPandas. Here's a page where they talk about scaling. But maybe it would be interesting to do a side by side comparison with a small, medium, and large vector use case.

This does make me reflect on some conversations we've had about standardizing / building out a vector API in pygeoprocessing. We haven't given vectors the treatment that we've given rasters. I'm not saying we need to tackle that now, but am curious about whether something like this makes sense to live in PGP from the start?

My only feature recommendation would be to provide an optional "copy" argument. I could see not wanting to edit the vector directly but instead make a copy. This could be a separate step beforehand, but might be a nice convenience feature too.

natcap / invest

Proposal: utility to apply a function to each feature in a vector #1595