protomaps / OSMExpress

Fast database file format for OpenStreetMap
BSD 2-Clause "Simplified" License
229 stars 19 forks source link

Add polygon extract queries to the Python API #22

Closed CloudNiner closed 8 months ago

CloudNiner commented 4 years ago

We're exploring an application of OSMExpress that is interested in retrieving objects filtered by specific tags from a polygon bounding area.

In its simplest form, I'm imagining a python method that is something like:

# Polygon could be any number of things, such as a Shapely polygon, 
# a python dict representation of GeoJson, a list of coordinate tuples or a bbox
def extract(region: Polygon): [CellIds]

that reproduces the functionality of https://github.com/protomaps/OSMExpress/blob/master/src/extract.cpp#L130-L134

bdon commented 4 years ago

+1, should be doable by:

  1. bring in the s2 bindings
  2. use the s2 region coverer on your input geometry
  3. traverse the cell index based on prefix of the coverings, putting node IDs into a set
  4. do a pass over the set of nodes, finding the set of referenced ways, relations... etc
CloudNiner commented 3 years ago

I attempted a first pass at this, but got a bit stuck. From your comment, are you suggesting that this can be done entirely in Python using the existing osmx bindings, perhaps pending addressing #23?

I'm using the s2sphere library as it appears to provide an s2geometry installation in an easily pip installable package so that should address getting cell ids.

In attempting to unpack

traverse the cell index based on prefix of the coverings

Can I pass the cell id I get back from RegionCoverer directly to the osmx bindings Index class? Is the name argument to pass to Index constructor the same as the lmdb table names, e.g. cell_node?

bdon commented 3 years ago

Can I pass the cell id I get back from RegionCoverer directly to the osmx bindings Index class?

Yes, if your coverer has both minimum and maximum cell levels set to 16, you can then query cell_node directly with this cell ID. This can be implemented as a first pass. It's not optimal though, because a huge region might need thousands or even millions of level 16 cells. A more efficient way is to let cells be anywhere from level 0 to 16 for a sparse representation of the covering; because every S2 parent cell shares a prefix with all its children, you can use LMDB cursor ranges to iterate over child cells efficiently like is done in the C++ code https://github.com/protomaps/OSMExpress/blob/master/src/storage.cpp#L171

The s2sphere package looks neat, I haven't tried it. I believe the S2 source tree itself has "official" python bindings but unsure what the workflow is to get that packaged up neatly. If the s2sphere library is easier to install and has everything we need, we should use that.

bdon commented 3 years ago

Is the name argument to pass to Index constructor the same as the lmdb table names, e.g. cell_node?

I think cell_node should work the same way as the other subclasses of Index, they're all just tables from int64 -> [int64]. Something like

class CellNode(Index):
    def __init__(self,txn):
        super().__init__(txn,cell_node')
CloudNiner commented 3 years ago

if your coverer has both minimum and maximum cell levels set to 16

Looks like I wasn't setting minimum as well as maximum. Will play around with this a bit more. The s2sphere region coverer returns a list of CellId which does have the child_begin and child_end methods on it so that more efficient implementation seems feasible at first glance.

I believe the S2 source tree itself has "official" python bindings

It does here but it's not obvious to me either how to include them in a python/pip project easily.

other subclasses of Index

🤦 sometimes reading source on a Monday am is hard?

Thanks for the hints, I'll update with progress, no ETA though. If you find another use case for this or begin working on it let me know.

CloudNiner commented 3 years ago

I managed to spend some time on this today, creating a Python CLI that generates a GeoJson file of nodes and ways filtered by the requested tags and bounding box.

Lots of possible future work left here, including but not limited to:

https://gist.github.com/CloudNiner/8b020f434d0509419032c6b2fe0577c5

bdon commented 3 years ago

Looks good so far! some random thoughts, some directly related and some more speculative:

Correctly generate LineString or Polygon for ways (currently always a LineString) Relation support

This should be done outside of osmx, but I think for ways you need to 1) ensure the orientation is correct and 2) ensure it is not self-intersecting, which OSM ways can be, but is not allowed per the OGC Simple Features spec If you intend to also support MultiPolygons, I think it's a pretty tough nut to crack, and a good implementation of multi polygon builders exists already within Osmium. Something I want to investigate is if a osmium handler can be "fed" directly from an osmx database instead of an .osm.pbf file

Support Polygon as input in addition to BBox Refactor the core operation of returning a list of OSM objects within a polygon to a method that can live in OSMExpress

the osmx API should only have one way to specify the query region, taking a covering - a list of cell IDs - as the argument. We should have some helper methods to convert from common geometry specs to S2 regions and coverings. In the C++ code I called this "region", so we should create another python module probably called "region" to handle this

Add option to allow user to specify AND or OR when providing multiple tag filters

unrelated to osmx but I implemented something similar in another project which you may want to look at: https://github.com/hotosm/osm-export-tool-python/blob/master/osm_export_tool/sql.py

Node = namedtuple("Node", ["id", "location", "tags", "metadata"]) Way = namedtuple("Way", ["id", "nodes", "tags", "metadata"])

I would avoid having to mirror the layout of OSM objects in your client code because you need to malloc for every object. Ideally the osmx bindings make it easy to take advantage of zero-copy views of OSM objects - this may mean the API provides functions that take sets of object IDs and uses yield and generator expressions to iterate over them

node_ids.extend(cell_node_index.get(cell_id.id()))

node_ids should generally be a python set because there will be lots of duplicates. And ideally iterating over that set should guarantee ascending order

CloudNiner commented 3 years ago

I updated my prior example to take an s2.S2Region instead of a bbox and wrote the Python code to convert a geojson.Polygon to s2.S2Region. There are other improvements, like the using generators to yield nodes and ways and the Set() fix described above. Here's the updated complete example: https://gist.github.com/CloudNiner/e5552aaae6c235193b8f196558b92e2c

Unfortunately, the parts of this example that might be useful in the OSMExpress python bindings (polygon -> s2region and node_ids_for_region) now utilize the s2 SWIG bindings because S2Polygon isn't implemented in s2sphere. This prevented a PR because the bindings are not directly pip installable. s2-py is a potential option but installs an out of date (from 2018) version of the library. It's also troublesome (no guidance from s2 docs and a few open issues relating to the SWIG bindings) to install the SWIG bindings to anywhere other than the global default python installation. There is an open PR that adds support for building a pip wheel but its been dormant for awhile now.

Regardless, I hope the example is instructive and provides a useful start to those looking to perform similar tasks or jump start a future effort to add some of these operations to the OSMX Python bindings when there's a clear path forward for inclusion of the s2 bindings as a dependency.

bdon commented 3 years ago

I agree that the bindings not being pip installable is an issue, thanks for figuring out the workaround.

Were you able to accomplish the goal of your script in an efficient manner using Python? It's useful to know if investing more in Python packaging is worth it for use cases similar to yours - If you encountered performance issues, it may be better to direct developers towards just using the C++ API

CloudNiner commented 3 years ago

Were you able to accomplish the goal of your script in an efficient manner using Python?

I'd say yes in this case. There's a few different performance bottlenecks a user might run into, but once I had the efficient implementation for the function s2region -> node id set, building a prototype with the python bindings was really quick. The s2sphere version (is pip installable!) of that function wasn't significantly slower than the s2 swig bindings so perhaps an implementation in osmx bindings for LatLngRect and Cap might be warranted and would serve any query use case well.

The search_by_ways code begins to slow as you increase query area or the number of results returned but as far as I can tell, the issue is just volume of operations being performed. I was still able to write 100s of mb of geojson results for areas up to 1000 sq kms returning thousands of ways, it just took 1-2mins. Whether that's satisfactory or not depends on individual use case, for us it was fine since our use case is focused on smaller areas with few results at a time.

bdon commented 3 years ago

https://gist.github.com/CloudNiner/e5552aaae6c235193b8f196558b92e2c#file-osmx_utils-py-L23

This can be made even faster :) as is, on line 23, if you are traversing a cell at a low level, I think you are potentially issuing millions of queries to Index.get - consider the case where there is only 1 record in the database for that low-level cell, but the cell is made of 1 million level16 nodes.

Instead, use the child_begin cell to set the lmdb cursor state and then move the cursor forward. You won't be able to use Index.get, you will need to write another method that manipulates the cursor; the logic and LMDB functions used should be the same as in the C++ code. This will limit the number of LMDB operations to the number of records in the actual cell, 1 record in the example above.

Curious to see if this improves on the slowness of big query areas you mentioned.

bdon commented 8 months ago

I'm closing this issue because of lack of activity.

The main use case for OSMExpress remains getting PBFs out of a minutely-updated planet .osmx. Analytical use cases in Python may be better served by https://docs.geodesk.com/python which looks like an excellent new project.