Localization of name suggestion

Currently the name suggestion index compiles global frequency counts of POIs for use in iD for suggestions / presets.

In reality the different brand names / chain stores etc are not evenly distributed and differ across countries. Many stores (and not least the usage of their localised names) are very confined.

To enhance the relevancy of the suggestions, it might be useful to partition the counts into countries and provide suggestions based on the occurence of the pois in the country being edited.

As a first step on the backend, I have attempted a new branch for the project at my repo - https://github.com/hlaw/name-suggestion-index/tree/countrycode

Format changes

The branch revise the project to add a new country code level at the top hierarchy of name-suggestions.json. The JSON format under each country is the same as the current global file. The threshold for generating topNames.json is lowered form 50 to 5 such that names from smaller / less well mapped countries would show up.

Changes made

In my setup the original getRaw.js could not finish processing the Asia extract and got killed after eating up several Gs of memory, and I could not got it to work under node. I have therefore rewritten it in C++ and call libosmium directly (the same backend as osmium-node). Besides counts, coordinates for each POI are saved for further processing.

In build.js, the process now checks the country code from the coordinates using https://github.com/hlaw/codegrid-js. It then counts the POIs by country.

Sample data

The branch contains demo data based on a recent pbf extract of asia (with 315M nodes / 10M ways). I have not download the planet to test but I would guess that the files would be 8-10 times the current size when run on planet.

To use data from the branch, iD would need to be able to load presets / suggestions dynamically when a user moves to a different country. This would probably require a set of country specific preset files to be built before deployment. For most users this should result in smaller download size and more relevant results in suggestions. I will try to explore how this could be done in iD.

Meanwhile as the change would break iD now this is just posted for review. Thank you.

This is very interesting work. I'm looking into it now, will get back to you within a few days. Love your initiative.

One thing that was important for name-suggestions that I don't see in place here is canonicalization. The idea being that there as many values with slightly different names for the same place, and we kept a list of the 'correct' names (canonical.json).

For example, in your fork: screen shot 2014-06-23 at 6 32 52 pm

but on osm.org: screen shot 2014-06-23 at 6 40 03 pm

This is because we have canonical rules in place to merge all the listed names to "7-Eleven", which is the most popular version in OSM. The threshold of 50 made it possible for me to only have to look at a few hundred names and merge them, rather than many thousands.

Have you thought about how to bring canonicalization over as well? Would that be organized on a per-country basis or for the entire list? Do you have any ideas for making canonicalization more automatic? Right now it is entirely a manual process, and any smart logic to try and pick 'correct' names might help out here.

Do you know how large localized suggestions would be all together (filesize)? I see that they are pulled in as needed depending on where you browse, which is great, but the package size bloat is something to consider early on as well.

Thanks for all this great work, localization has been an area the name-suggestions have been lacking and I'm glad you're focusing on it.

Thank you for reviewing my work.

Yes, you are absolutely right that there is a need to work on canonical rules before putting this into use, now that I have lowered the threshold to 5. I have now downloaded and processed the planet file and the resulting problem is even worse. I am thinking of using edit distance or substring search to find out similar groups of names automatically, manually scanning over them to see if any of these should not be grouped together, and pick the one with the highest frequency (perhaps semi-automatically). Hopefully this is still managable work.

For non-Latin languages, while I can read Chinese and Japanese names to sort out duplicates (with some local knowledge), it may be better to leave alone other languages that I do not even recognise (the duplication situation might not be as serious). Even for Latin names, in many cases without local knowledge I may not be sure if two names should remain distinct (the same chain store may slightly differentiate their name to advertise different service, or is KFC/Taco Bell the same as Taco Bell/KFC?) and perhaps it is better to err on the conservative side not to merge those except for obvious ones (case variation, space/hyphen, misspelling etc).

I think a global canonicalization list for applying to each country is largely ok. Global chain stores across countries mostly use the same brand name. Yet there may be some specific case that needs per country treatment, e.g.

say in Japan, instead of "7-Eleven", it is a local convention to use the Japanese name. This would need a country specific rule under the global one to use the Japanese name for the suggestions there.
another case is, for example, in a certain country there is distinct chain store called Seven-Eleven. Merging with 7-Eleven should be disabled only for that country.
So the format of the current canonical rules would probably need to be extended to override global behaviour for specific countries.

Some satistics now that the planet file is processed (global threshold = 5):

Total unique key/value/name combinations occuring >=5 times (before applying canonical rules): 28066
Size of name-suggestions.json (after applying the current canonical rules): 5.7M
Total size of country specific files when converted into the preset format under iD: 34.6M
Sizes of these suggestion presets for individual countries:
- 3.7M for Germany (133k after gzip)
- 2.1M for US (78k)
- 1.8M for UK (68k)
- 1.6M for Russia
- 1.3M for France
- 1M for Italy
- <1M for the rest (most are at 200-300K)
Processing time on planet (pbf): 58 minute clock time, AMD Phenom(tm) II X6 1075T, 8G RAM

The iD build process now expands the suggestions to include "parent" tag information (hence the expansion of the file size from 5.7M to 34.6M). If this "inflation" is done by the client (most of the field is only necessary when displaying the suggestion after searching) this raw size could be made much smaller (perhaps 500k for Germany before gzip). Bearing in mind the savings of not loading the current global preset upfront (which is also several hundred K in filesize) I think there is not much impact to the user experience / server bandwidth. A final resort is to raise the threshold for countries with "too much" names.

Separately, with more local information at the backend, I am working to improve the search in iD to display more relevant suggestions that the user might want. Besides tweaking with searching and sorting, one valuable resource is in fact the canonical rules. Say even if Kentucky Fried Chicken is canonicalized to KFC, people typing "Kentucky" is still likely to want KFC. So KFC should probably be displayed on matching one of its canonical names. Similarly, the name in another language should be recognized. The concept may be similar to the "terms" keyword field for the presets. In my working branch for name-suggestion-index I have therefore included information of non-trivial (eg not just upper - lower case conversion) canonical names in a new field under the suggestion json file for use by iD. I will tackle the canonical rules to sort out duplicates above after working on the search in iD.

Hope that this could help further contribute towards user experience and tagging consistency for OSM.

For example shop/variety_store|Bazar should not be proposed in Poland or for Polish editors as in Polish "bazar" means "bazaar" or "marketplace" and this brand is not present in Poland.

I've added a countryCodes array property for entries that should only be shown when editing in certain countries.

  "amenity/cafe|スターバックス": {
    "count": 608,
    "countryCodes": ["jp"],
    "tags": {
      "amenity": "cafe",
      "brand": "スターバックス",
      "brand:en": "Starbucks",
      "brand:wikidata": "Q37158",
      "brand:wikipedia": "ja:スターバックス",
      "cuisine": "coffee_shop",
      "name": "スターバックス",
      "name:en": "Starbucks"
    }
  },
  "amenity/cafe|星巴克": {
    "count": 258,
    "countryCodes": ["cn", "tw"],
    "tags": {
      "amenity": "cafe",
      "brand": "星巴克",
      "brand:en": "Starbucks",
      "brand:wikidata": "Q37158",
      "brand:wikipedia": "zh:星巴克",
      "cuisine": "coffee_shop",
      "name": "星巴克",
      "name:en" "Starbucks"
    }
  },

osmlab / name-suggestion-index