stuartemiddleton / geoparsepy

geoparsepy is a Python geoparsing library that will extract and disambiguate locations from text. It uses a local OpenStreetMap database which allows very high and unlimited geoparsing throughput, unlike approaches that use a third-party geocoding service (e.g. Google Geocoding API). this repository holds Python examples to use the PyPI library.
Other
54 stars 4 forks source link

Running Geoparsepy with other languages #4

Closed datavistics closed 3 years ago

datavistics commented 3 years ago

Currently the following languages are supported:

English, French, German, Italian, Portuguese, Russian, Ukrainian All other languages will work but there will be no language specific token expansion available

Ive followed the instructions and gotten geoparsepy working with the example.

I tried adding a sentence to your listText: u'Hola, vivo en Madrid España' but its not finding anything. The location of "Madrid España" should be pretty easy to find as its a direct lookup.

Do you have any advice on how to approach handling other languages?

stuartemiddleton commented 3 years ago

The problem with Madrid is actually not the language or geoparsepy, but the entry in OpenStreetMap itself.

In OpenStreetMap the city of Madrid is defined both as (a) admin level 6 "Comunidad de Madrid" and (b) admin level 8 "Madrid". The entry at admin level 6 does not have an alternative name 'Madrid' in its OpenStreetMap entry, so it will not match. The database dump for global_cities is only down to admin level 6 (to keep the size reasonable),

see https://www.openstreetmap.org/relation/5326784 https://www.openstreetmap.org/relation/6426653

You can make a new focus area of course for Spain, down to admin level 8, which should then capture Madrid the city. Then you should be able to match your sentence OK.

datavistics commented 3 years ago

Thanks so much for your reply and checking that @stuartemiddleton

The database dump for global_cities is only down to admin level 6 (to keep the size reasonable), Im a bit surprised a world capital didnt show up, but no worries. I think your design decision is a good one. And thanks for explaining the admin levels.

What Im most interested in is the name:es and other languages. When I tried the list below, which should be an exact match from here: https://www.openstreetmap.org/relation/61320 I didnt get the match back. Is this solvable?

[u'hola Nueva York, EE. UU. su factura de Bassett llamando', u'Nueva York']

Text = hola Nueva York, EE. UU. su factura de Bassett llamando Location [index 119105 osmid (-134353,) @ 2 : 2] = york Location [index 758448 osmid (153595296,) @ 2 : 2] = york Location [index 758451 osmid (153968758,) @ 2 : 2] = york Location [index 106968 osmid (-1425436,) @ 2 : 2] = york Location [index 758452 osmid (158656063,) @ 2 : 2] = york Location [index 758449 osmid (153924230,) @ 2 : 2] = york Location [index 758447 osmid (153473841,) @ 2 : 2] = york Location [index 758446 osmid (151672942,) @ 2 : 2] = york Location [index 758455 osmid (316990182,) @ 2 : 2] = york Location [index 758445 osmid (151651405,) @ 2 : 2] = york Location [index 800785 osmid (20913294,) @ 2 : 2] = york Location [index 758444 osmid (151528825,) @ 2 : 2] = york Location [index 137743 osmid (-79510,) @ 4 : 4] = ee Location [index 73914 osmid (-2750833,) @ 8 : 8] = su Location [index 110803 osmid (-1070414,) @ 8 : 8] = su Location [index 28168 osmid (-6390707,) @ 8 : 8] = su Location [index 79839 osmid (-2390843,) @ 8 : 8] = su Location [index 133309 osmid (-162110,) @ 10 : 10] = de Location [index 140297 osmid (-51477,) @ 10 : 10] = de Location [index 455946 osmid (253067120,) @ 11 : 11] = bassett Location [index 705546 osmid (151840681,) @ 11 : 11] = bassett Location [index 705545 osmid (151463868,) @ 11 : 11] = bassett Disambiguated Location [index 0 osmid (-1425436,) @ 2 : 2] = York County;York : http://www.openstreetmap.org/relation/1425436 Disambiguated Location [index 2 osmid (253067120,) @ 11 : 11] = : http://www.openstreetmap.org/node/253067120 Text = Nueva York Location [index 119105 osmid (-134353,) @ 1 : 1] = york Location [index 758448 osmid (153595296,) @ 1 : 1] = york Location [index 758451 osmid (153968758,) @ 1 : 1] = york Location [index 106968 osmid (-1425436,) @ 1 : 1] = york Location [index 758452 osmid (158656063,) @ 1 : 1] = york Location [index 758449 osmid (153924230,) @ 1 : 1] = york Location [index 758447 osmid (153473841,) @ 1 : 1] = york Location [index 758446 osmid (151672942,) @ 1 : 1] = york Location [index 758455 osmid (316990182,) @ 1 : 1] = york Location [index 758445 osmid (151651405,) @ 1 : 1] = york Location [index 800785 osmid (20913294,) @ 1 : 1] = york Location [index 758444 osmid (151528825,) @ 1 : 1] = york Disambiguated Location [index 0 osmid (-1425436,) @ 1 : 1] = York County;York : http://www.openstreetmap.org/relation/1425436

stuartemiddleton commented 3 years ago

Have you added 'es' in the 'lang_codes' of the config?

Only 'en' is used by default in the example. The idea is you specify in the config the expected language set used by your text corpus, to avoid false matches of translations for location names in languages that are very unlikely.

I checked and the global_cities database dump does have 'name:es' = 'Nueva York' so it should be picked up.

When global_cities is loaded the following OSM name variations are loaded: name:; alt name:; old name:

try: dictGeospatialConfig = geoparsepy.geo_parse_lib.get_geoparse_config( lang_codes = ['en','es'], logger = logger, whitespace = '"\u201a\u201b\u201c\u201d()', sent_token_seps = ['\n','\r\n', '\f', '\u2026'], punctuation = """,;\/:+-#~&*=!?""", )

You can also create language specific stoplist, whitelist and blacklist resources for each language if you wish. See the folder geoparsepy is installed in for examples for 'en' 'ru' etc.

On a Win10 install this would be C:\Program Files\Python3\Lib\site-packages\geoparsepy*

Just copy the files like corpus-buildingtype-en.txt and make a spanish version corpus-buildingtype-es.txt and it will be picked up automatically when get_geoparse_config() is called. This allows manual customization to remove false positives.

datavistics commented 3 years ago

Have you added 'es' in the 'lang_codes' of the config?

I did not... sorry for missing something quite basic.

  1. It looks like this is used to generate the cached_locations and the indexed_locations. At run time I will want to be able to choose the language dynamically. Is its possible to cache all languages, but specify the language when I query? In this case I would load both en and es, but only search in es when I know the language?

  2. Also, I have an NER that works really well. I was planning to just send in a list of pure entities as an input. Is that compatible with your library? Is there a better approach?

stuartemiddleton commented 3 years ago

The call to cache_preprocessed_locations() is where the index of location names is created, and this is where you need the lang codes you want to match. This will run a slow SQL query to make the location table, but it is something you can do at runtime.

You can (offline) make multiple language-specific versions of cached_locations and serialize them (e.g. Python pickle). Then simply load the one you need when you need it (serialize to disk if too large in RAM to cache them all).

NER (e.g. Stanza) will have been trained with a gazetteer such as geonames as well as POS tags. It will learn patterns to detect LOC named entities. geoparsepy does named entity matchin (to a cached lists of location name variants), NER does named entity recognition (to linguistic patterns that often represent locations). read my paper for a discussion of both apporaches and the strengths and weaknesses. NER models will not disambiguate the locations to OSM entities (so its just plain text e.g. 'New York' not 'New York, USA'). You could run both approaches on a sentence and maybe use one to add confidence to the other - but you should read papers from the geoparsing / location extraction literature to to get a feel for better approaches and the strengths/weaknesses of them.

datavistics commented 3 years ago

You can (offline) make multiple language-specific versions of cached_locations and serialize them (e.g. Python pickle). Then simply load the one you need when you need it (serialize to disk if too large in RAM to cache them all).

I ended up making one big cache. It wasn't too big - 2.4GB, and let me handle all ~270 languages. I didn't need to separate it by language and it seemed to work pretty well.

Thanks for the explanation about your work and the approaches. That and your 2018 paper helped me understand this problem a bit better.