openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
3.99k stars 414 forks source link

Intersecting Street Support #657

Closed AidanWelch closed 2 months ago

AidanWelch commented 3 months ago

Hi, thanks to everyone for their work on this project, I was wondering would support for intersecting streets be possible, as in "corner of foo street and bar lane". I had hoped the "near" label would be that but it doesn't seem to be.

Thanks!

albarrentine commented 3 months ago

simple intersections are generated randomly from the OSM road network as part of the training data. They are labeled as road (the example given will parse as such) not as a separate tag or tags, mostly because some of the input sources like OSM/OpenAddresses contain intersections as well and simply label them as part of addr:street, etc. so making it a separate tag would involve recoding all the natural language examples which may include intersections in the street name, and it's not always clear where in the world that is the right thing to do, not to mention that recoding natural language inputs is tantamount to the parsing problem itself. We try to stick to the lowest common denominator that retains fidelity to the source data (which is usually what's being matched by the geocoder, etc. anyway), and only append/generate things like apartment numbers/sub-building info when they're clearly absent. For English it's a relatively simple post-processing task to regex/split out the most common highly-structured variants of this, and there are yaml dictionaries included in the repo that aid in doing so. Some post-parse logic would be required in any case to determine e.g. which street is primary, which is secondary, and if the address is simply the intersection of two streets or the Manhattan-style address e.g. 123 45th St corner of 6th Ave, etc. Libpostal will tend to get things like "123 45th St near 6th ave" and "123 45th St a block from 6th Ave" as part of the street as well, but it's easy to see that trying to tag those quickly starts to devolve into parsing natural language directions, etc. which is beyond the intended scope of this library.

The near tag is used for categorical geocoder-style queries such as "book stores in brooklyn ny" and could in theory also be used for landmark-based addresses e.g. in India but there's not a ton of training data available on that front.

AidanWelch commented 2 months ago

I see, thank you!