Implement `split_multipolygons` for OGR process

cholmes commented 1 year ago

Description

Add the splitting of multipolygons to the ogr process. I'm not sure if it's possible to do pure CLI call to do this operation, so it may need to make use of Fiona, but that may lose the speed of the column-oriented API. So if it just ends up being about the same speed as pandas (with fiona under its hood) then perhaps we just don't implement.

theroggy commented 10 months ago

You can pass the -explodecollections parameter to ogr2ogr to convert multi-part geometries to single part in the output. However, this won't update the area obviously.

I wonder though, could you explain why you want to explode the geometries?

cholmes commented 10 months ago

You can pass the -explodecollections parameter to ogr2ogr to convert multi-part geometries to single part in the output. However, this won't update the area obviously.

Ah, good to know. But yeah, this feels like it needs a bit more customization than what you can do with GDAL out of the box.

I wonder though, could you explain why you want to explode the geometries?

It's really just for this particular google buildings dataset. It's distributed in CSV with WKT, and some small percent of the geometries are multipolygons (certainly less than 1%, perhaps even less than 0.1%?). The data set was clearly made by computer vision people who don't understand geospatial, and in my experience a number of tools work better if you have all of one geometry type. Yes, shapefile munges them together, so most 'can deal', but it feels far cleaner to have exactly one geometry type - especially with these buildings, it makes sense to me that each building would be one row.

But as I mentioned in https://github.com/opengeos/open-buildings/pull/53#issuecomment-1794973859 it'd be much nicer to just have a clean library that compares read and write performance from any major format to another. I'd not even include 'csv' in that, and it wouldn't need to do any exploding of geometries.

theroggy commented 10 months ago

You can pass the -explodecollections parameter to ogr2ogr to convert multi-part geometries to single part in the output. However, this won't update the area obviously.

Ah, good to know. But yeah, this feels like it needs a bit more customization than what you can do with GDAL out of the box.

It depends... you don't need to do customizations but without them the area will have to be recalculated for all rows... which is a bit less efficient with that low a percentage of exploded rows.

I wonder though, could you explain why you want to explode the geometries?

It's really just for this particular google buildings dataset. It's distributed in CSV with WKT, and some small percent of the geometries are multipolygons (certainly less than 1%, perhaps even less than 0.1%?). The data set was clearly made by computer vision people who don't understand geospatial, and in my experience a number of tools work better if you have all of one geometry type. Yes, shapefile munges them together, so most 'can deal', but it feels far cleaner to have exactly one geometry type - especially with these buildings, it makes sense to me that each building would be one row.

OK. I always do the other way around: if there is a mixture, I convert everything to multipolygon so it can be stored in one table/file. FYI: pyogrio automatically converts all geometries to MultiPolygons if you save a GeoDataFrame with both Polygons and MultiPolygons.

But as I mentioned in #53 (comment) it'd be much nicer to just have a clean library that compares read and write performance from any major format to another. I'd not even include 'csv' in that, and it wouldn't need to do any exploding of geometries.

I'm not sure I'll get to it, at least not on short term, but if you would be interested, you can find some other benchmarks involving geo operations I did in the past here: https://github.com/geofileops/geobenchmark

cholmes commented 10 months ago

It depends... you don't need to do customizations but without them the area will have to be recalculated for all rows... which is a bit less efficient with that low a percentage of exploded rows.

Yeah, I just meant you can't do an easy one-liner from ogr2ogr that does it all in one. And agreed, a second run just to recalculate area won't make the comparison great. I think it's fine for it to not explode rows, the other two options just enabled this all in one pass.

OK. I always do the other way around: if there is a mixture, I convert everything to multipolygon so it can be stored in one table/file.

Yeah, that's the practical way to do things, given the state of geospatial data formats (shapefile still being widely used) and the state of the tools. With this I was working towards distributing data in a 'better' way, and it just strikes me it's better to be able to differentiate between multi-polygons and polygons. If this was 'facilities' that could include multiple buildings in each then a multipolygon makes sense. If it's supposed to be every building, but some are squeezed into a single geometry then that makes less sense.

FYI: pyogrio automatically converts all geometries to MultiPolygons if you save a GeoDataFrame with both Polygons and MultiPolygons.

Cool - good to know. I forget what tool I was working with but there was one that was barfing if you just threw this dataset at it.

I'm not sure I'll get to it, at least not on short term, but if you would be interested, you can find some other benchmarks involving geo operations I did in the past here: https://github.com/geofileops/geobenchmark

Oh nice! I'll check it out.

opengeos / open-buildings

Implement `split_multipolygons` for OGR process #6

Description