stuartemiddleton / geoparsepy

geoparsepy is a Python geoparsing library that will extract and disambiguate locations from text. It uses a local OpenStreetMap database which allows very high and unlimited geoparsing throughput, unlike approaches that use a third-party geocoding service (e.g. Google Geocoding API). this repository holds Python examples to use the PyPI library.
Other
54 stars 4 forks source link

Reducing false positives #2

Closed flppgg closed 3 years ago

flppgg commented 3 years ago

This is not really an issue, but I am writing this to start a conversation and ask a few questions. First of all thanks a lot for making this public, it is a very useful tool.

Sometimes geoparsepy returns many false positives, which makes it difficult to filter out the locations that we are interested in.

For example, take the sentence "They are a ga machine". Just a random sentence, no meaning. "ga" is picked up as a possible location, and matched to a number of possible osmid.

How can we filter out such false positives? I noticed that very often this happens with short words, of 2-3 letters. A very rough approach would be to filter out all short words that returned a match, but I am sure something more nuanced is possible.

Thanks again

stuartemiddleton commented 3 years ago

There are lots of locations in the world that have short names which match common terms (e.g. Oklahoma == ok). These are very difficult to match reliably using an entity matching approach, as it requires more information about the context in which the location is mentioned to know if its really a location or something else.

In geoparsepy there is a practical workaround. It supports a whitelist and blacklist, so if you observe location terms matching in error or being missed you can add them to these lists to make sure they are treated appropriately. You are essentially using relevance feedback to screen our errors for future runs.

In your example "They are a ga machine" >> bad match = 'ga'

You can add 'ga' to the blacklist. e.g.

listBlacklist = [ 'ga' ] dict_geospatial_config['blacklist'].extend( listBlacklist )

If you have phrases you want to always be allowed then add them to the whitelist. e.g.

listWhitelist = [ 'uk' ] dict_geospatial_config['whitelist'].extend( listWhitelist )

You can print out the default blacklist and whitelists in geoparsepy, so you can be in control of it. e.g.

print( dict_geospatial_config['whitelist'] )

Going beyond blacklist and whitelists you would need to add an additional contextual analysis step, learning linguistic patterns around how location terms like 'ga' or 'uk' should and should not be used. That can be added, but is beyond the scope of geoparsepy. Some of the more recent NLP deep learning approaches using pre-trained word embeddings are perhaps the direction to go if this is to be considered, using the fine-tuned models as a post-processing geoparse step (to assign confidence in geoparsepy matches) or as an ensemble geoparser (learning which to trust in different situations).