pelias / wof-admin-lookup

Who's on First Admin Lookup for the Pelias Geocoder
https://pelias.io
MIT License
9 stars 24 forks source link

Replacing raw wof with proper database #240

Closed NickStallman closed 3 years ago

NickStallman commented 5 years ago

I'm looking at getting the WOF dataset in to a proper database anyway for another project, possibly PostGIS, and I'm curious if there has been any past consideration to doing this for Pelias' import process as an additional option.

Specifically I've noticed a few issues with the current arrangement:

So while I'm importing it in to PostGIS for my own purposes I can also knock up a quick PostGIS query server that's compatible with pip-service to compare the two - since it can be queried via HTTP it makes it very easy to do. :) Another benefit is I can use the sqlite database to get the data in to PostGIS instead of the flat json files. PostGIS is probably a reasonable optional dependency to add if Mapnik is being used as well.

Before I do this, has anyone else looked at this possibility before?

orangejulius commented 5 years ago

We have looked at this. We're big fans of SQLite both for simplicity and it's incredible read-only performance.

There is a need with Pelias to have an extremely fast, in memory point-in-polygon service, and we'll be improving the PIP service as time goes on. It will likely never have super fast loading time, or low memory usage, but our import process is mostly bottlenecked by how fast the PIP service can go, so it's worth those tradeoffs.

However, we do plan to switch the PIP service to reading from a line-delimited JSON format for administrative data (which would initially, but doesn't have to be, a pure transformation of WOF data). We could also perform some of the cleanup steps we currently perform every time the PIP service starts (like removing undesirable records, simplifying polygons, etc) once, when building this dataset, rather than every time the PIP service starts. That will solve the tons of inodes problem.

We've considered also having a disk-backed PIP service that is a drop-in replacement. This would make sense for a lot of production clusters that aren't doing imports, since they might want the PIP data handy, but not want to pay the cost of many GB of RAM to get that extra bit of speed. SQLite, like I said, is plenty fast.

So yeah, if exploring that second service is of interest to you, we'd gladly chat about it further.

orangejulius commented 5 years ago

It's also worth noting that @missinglink has already played with SQLite a bit for the PIP service, with pretty encouraging results.

missinglink commented 5 years ago

I have had a lot of success in the past using spatialite for PIP, I believe I was getting several thousand results per second across all layers on a single-core.

If you go down that route I'd strongly recommend you enable mmap and you split your polygons up, I use a function which split them in half recursively until the polygon contains no more than N vertices, this way you can ensure linear performance regardless of the number of records or vertices in a polygon.

check out:

missinglink commented 5 years ago

Also the nice thing about spatialite is that there is one file, so it's super portable, the load time is milliseconds and the RAM usage is very low.

The cons are that it's slightly difficult to install and you really want to compile a modern version to ensure that this bug I reported back in 2016 is not present in your version.

missinglink commented 5 years ago

This site is running off spatialite: https://pip.synergy.io

NickStallman commented 5 years ago

Haha d'oh sorry guys, here I was assuming sqlite didn't have spatial extensions available. Yes it makes much more sense than PostGIS.

Using spatialite directly resolves the majority of the points I mentioned.

orangejulius commented 3 years ago

This issue definitely describes the Spatial Service. Coming soon(ish) to a Pelias near you.