opengeos / open-buildings

Tools for working with open building datasets
https://opengeos.github.io/open-buildings
Other
124 stars 17 forks source link

Some performance quick wins for the geopandas implementation #53

Closed theroggy closed 11 months ago

theroggy commented 11 months ago

I encountered a link to your blog post with some performance comparisons between file formats. Because the performance differences there were not quite what I expected I got curious and had a look at the code.

This PR should give a boost to the performance of .fgb, .gpkg and .shp. I disabled creation of the spatial index on .gpkg because all other formats also don't have a spatial index and creating the spatial index takes quite some time. If you want to do serious spatial analyses using sql on the .gpkg file the spatial index can obviously be a huge advantage, but I don't think this is the case.

cholmes commented 11 months ago

Awesome! Yeah, after I published the post Kyle Barron pointed out that pyogrio would make things faster. It was on the long list of things to check out, so I really appreciate this PR.

And removing the spatial index does make sense too - I also thought about that after the post. Ideally there'd be an option to create it or not. Since then I've also realized that adding a quadkey column to GeoParquet (like in this code) serves as a decently effective spatial index. So we could use that for more of an apples to apples comparison for when the spatial index is 'on'.

cholmes commented 11 months ago

Oh, and feel free to write a blog post with new numbers, I'd definitely promote it and link to it from my original post. I'd write it myself but my side project queue is vast these days so I doubt I'll get to it any time soon.

The other thing I really want is to make a new project that compares both read and write for any format, and isn't just limited to this google building processing. Like just simple conversions, but make it easy to report out. This was just a side effort as I was working with a couple datasets, but I think it'd be awesome as its own project. I'd be happy to pitch in if you start on that, and to figure out a good home (perhaps in https://github.com/geopython).

If you're interested feel free to contact me on slack, I'm on the cloud native geo slack, can use this invite link