Closed theroggy closed 11 months ago
Awesome! Yeah, after I published the post Kyle Barron pointed out that pyogrio would make things faster. It was on the long list of things to check out, so I really appreciate this PR.
And removing the spatial index does make sense too - I also thought about that after the post. Ideally there'd be an option to create it or not. Since then I've also realized that adding a quadkey column to GeoParquet (like in this code) serves as a decently effective spatial index. So we could use that for more of an apples to apples comparison for when the spatial index is 'on'.
Oh, and feel free to write a blog post with new numbers, I'd definitely promote it and link to it from my original post. I'd write it myself but my side project queue is vast these days so I doubt I'll get to it any time soon.
The other thing I really want is to make a new project that compares both read and write for any format, and isn't just limited to this google building processing. Like just simple conversions, but make it easy to report out. This was just a side effort as I was working with a couple datasets, but I think it'd be awesome as its own project. I'd be happy to pitch in if you start on that, and to figure out a good home (perhaps in https://github.com/geopython).
If you're interested feel free to contact me on slack, I'm on the cloud native geo slack, can use this invite link
I encountered a link to your blog post with some performance comparisons between file formats. Because the performance differences there were not quite what I expected I got curious and had a look at the code.
This PR should give a boost to the performance of .fgb, .gpkg and .shp. I disabled creation of the spatial index on .gpkg because all other formats also don't have a spatial index and creating the spatial index takes quite some time. If you want to do serious spatial analyses using sql on the .gpkg file the spatial index can obviously be a huge advantage, but I don't think this is the case.