Reducing false positives

This is not really an issue, but I am writing this to start a conversation and ask a few questions. First of all thanks a lot for making this public, it is a very useful tool.

Sometimes geoparsepy returns many false positives, which makes it difficult to filter out the locations that we are interested in.

For example, take the sentence "They are a ga machine". Just a random sentence, no meaning. "ga" is picked up as a possible location, and matched to a number of possible osmid.

How can we filter out such false positives? I noticed that very often this happens with short words, of 2-3 letters. A very rough approach would be to filter out all short words that returned a match, but I am sure something more nuanced is possible.

Thanks again

There are lots of locations in the world that have short names which match common terms (e.g. Oklahoma == ok). These are very difficult to match reliably using an entity matching approach, as it requires more information about the context in which the location is mentioned to know if its really a location or something else.

In geoparsepy there is a practical workaround. It supports a whitelist and blacklist, so if you observe location terms matching in error or being missed you can add them to these lists to make sure they are treated appropriately. You are essentially using relevance feedback to screen our errors for future runs.

In your example "They are a ga machine" >> bad match = 'ga'

You can add 'ga' to the blacklist. e.g.

listBlacklist = [ 'ga' ] dict_geospatial_config['blacklist'].extend( listBlacklist )

If you have phrases you want to always be allowed then add them to the whitelist. e.g.

listWhitelist = [ 'uk' ] dict_geospatial_config['whitelist'].extend( listWhitelist )

You can print out the default blacklist and whitelists in geoparsepy, so you can be in control of it. e.g.

print( dict_geospatial_config['whitelist'] )

Going beyond blacklist and whitelists you would need to add an additional contextual analysis step, learning linguistic patterns around how location terms like 'ga' or 'uk' should and should not be used. That can be added, but is beyond the scope of geoparsepy. Some of the more recent NLP deep learning approaches using pre-trained word embeddings are perhaps the direction to go if this is to be considered, using the fine-tuned models as a post-processing geoparse step (to assign confidence in geoparsepy matches) or as an ensemble geoparser (learning which to trust in different situations).

stuartemiddleton / geoparsepy

Reducing false positives #2