spencermountain / compromise

modest natural-language processing
http://compromise.cool
MIT License
11.41k stars 654 forks source link

places().address() #272

Open davidbuhler-zz opened 7 years ago

davidbuhler-zz commented 7 years ago

Entity extraction should include address extraction when the entity is a place.

I believe we can omit the need to use a dictionary of street names if there is a pattern match for (number)(string)(comma)(optional string or number)(city).

spencermountain commented 7 years ago

hey david, thanks for your help. it's sorta hilarious, I've been working on a new version that supports just this sort of syntax-matchy-stuff. It's not ready yet, but this soon should be like:

nlp(myText).match('#Value #Street? #City #Country?').tag('#Address')

or something like that. I'll add your street designations, and some kinda Street tag, or something like that - any thoughts? cheers.

spencermountain commented 7 years ago

how has the performance for address resolution been so far? it would be cool to bake-in the logic for pulling-out parsed numbers, postal/zip codes, etc..

davidbuhler-zz commented 7 years ago

I think the performance is tricky for all entity extraction and I can only speculate how it works in CoreNLP and Gate without looking more deeply..

When I looked at GATE and JAPE for Place/Location address extraction, I realized there are a lot of permutations for Address Matching, and GATE really focuses on solving UK address matching.

Gate/Jape seems to add to the Address if patterns exist, working in order of priority. Address > Object

If (PO Box) > add if City exists if (Street Number near street suffix) > add if City exists if (Street Number near street abbreviation) > add if City exists if (Postal Code) > add if Province/State exists

The cities might need to be dictionaries added for each State/Province mentioned. I think the patterns would have to be driven by Country, which in turn, has to be driven by State/Province look-up (since most people leave the country out of context when conversing).

NLPC would need a property for the nearby tokens to perform a lookup. I think\ the most efficient way to address the problem is to only perform a proximity lookup on strings if a common State/Province is mentioned, but I can't think of how to flag a State/Province as a likely "Place" in a given type of context, which would speed things up quite a bit.

For example, Rule: Only perform State/Province look-up if State/Province is preceded by "at" or "in" or "from" and State/Province is capitalized.

playground commented 6 years ago

@spencermountain when I try "Atlanta" or "Marietta".

nlp.debug() does not recognize them as place. How can I add them to the dictionary?

-- 'Phoenix' - TitleCase, City, Place, Singular, Noun, ProperNoun 'AZ' - Noun, Acronym, Singular, Region, Place, ProperNoun 'atlanta' - Noun, Singular 'georgia' - Region, Place, Singular, Noun, ProperNoun 'marietta' - Noun, Singular

playground commented 6 years ago

nm, got it.

let doc = nlp('Phoenix AZ atlanta georgia marietta', {Atlanta: 'Place', Marietta: 'place'}); doc.debug();

Actually is this the optimal way of doing it?

playground commented 6 years ago

@spencermountain what cities are included in the library? Where can I get that list?

spencermountain commented 6 years ago

@playground please look around before asking. it's pretty easy to find! ./data/words/places/cities.js