openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.03k stars 416 forks source link

Question - any plans for address extraction? #22

Open maccman opened 8 years ago

maccman commented 8 years ago

Are there any plans to implement address extraction. Say extracting an array of addresses from a large body of text?

albarrentine commented 8 years ago

Excellent question. It's something I've thought a bit about, though it's not on the short-term roadmap. It would be great to have an open source version of what your phone does (extracting addresses from emails, etc. though I believe most of those implementations are regex-based and hence somewhat limited, probably more than a little US-biased).

The NLP approach would be to treat it as an entity recognition problem and get/generate some labeled sentences, e.g.

We/O are/O located/O at/O 123/I Main/I Street/I ./O

This is using simple I/O encoding (each token is labeled "I" for "inside" an address or "O" for "outside" the address). The current address parser is trained on something similar, although in the parsing case every token is known to be part of an address and our model needs to figure out whether it's part of a street name vs. a country vs. a suburb, etc.

I personally don't know of any such "address entity recognition" data set (if you do, please share!) but have a hunch that one could be constructed. One way to accomplish this would be to insert known addresses into, say, Wikipedia articles that link to venues. Replace the venue link with its address, and it's likely you'll start to see patterns of locatives e.g. "at", "by", etc. which tend to precede mentions of places. Since it's Wikipedia, this could be replicated in basically every language (which we like in libpostal). A structured learning model similar to the one we have for the current address parser could be trained using almost the exact same code, just different labels.

If you know of/are using any data sets that might be useful for this task, or if I'm missing some elephant in the proverbial room, let me know.

casa87 commented 8 years ago

I would really love to see this feature :+1:

MaitlandMarshallMEX commented 7 years ago

Me too! Has there been any news on this?

SephVelut commented 7 years ago

datamade/usaddress uses parserator to kind of do this. It can parse the string "yeah I live at 1200 valley lane st I think"

albarrentine commented 7 years ago

@SephVelut not exactly. If you're willing to provide your own training data and write your own feature extraction, sure, parserator (or CRFSuite, on which parserator is based, or any other sequence modeling library/toolkit) can be trained to extract addresses from strings, but to be clear for others reading this in the future, that is not what usaddress does at the time of this writing.

Here the actual usaddress parse for that string:

[(u'yeah', 'Recipient'),
 (u'I', 'Recipient'),
 (u'live', 'Recipient'),
 (u'at', 'Recipient'),
 (u'1200', 'AddressNumber'),
 (u'valley', 'StreetName'),
 (u'lane', 'StreetNamePostType'),
 (u'st', 'PlaceName'),
 (u'I', 'PlaceName'),
 (u'think', 'PlaceName')]

i.e. it is only able to return the labels on which it was trained. That's fine in the geocoder use case or when trying to separate a combined address field in a database/CSV because all of the text is known to be part of an address and must be labeled one component or the other, but doesn't work for web scraping, bots, etc.

What was being discussed here is a model that is specifically trained to label each word in a sentence with "is this word part of an address or not?" as shown above. Libpostal already has a decent model for the "in" text, but modeling the "out" text (non-address words and more importantly, how they transition into addresses in real text) in a principled way is quite an undertaking, especially implementing it at libpostal's scale with high accuracy in multiple languages such that it could be used on arbitrary websites.

It's possible, and I've posited some ways to do it above, so if an NLP researcher wants to take that on, that's great, but as far as me implementing it, a company would need to sponsor the work.