Distribute another file that contains that localized names of everything

bhousel commented 1 year ago

I was chatting with @1ec5 about this issue of localizing the names that we use for presets. It's an issue that currently affects the flags in NSI, but would also affect some of the new categories we are considering adding, like Species (#8324) or Religions (#5960 et al)

The summary is - we currently have a Display Name property for each item in NSI, and this is used for the name of the preset that gets displayed in iD or JOSM. These strings are currently only in the language that we think the user would be using. We don't offer any localization of these strings.

It would be useful to allow users searching for a preset to be able to type other things. So we'd need some other source of data for the different names an item could be known by.

Wikidata already provides this, somewhat, because labels can be entered in many different languages, and "also known as" property is available too. There are also some properties to track common names that things are known by, like P1843.

We haven't tried to tackle localization in NSI yet, but I'm wondering whether we could just gather up all these names and languages in another sidecar file and distribute it alongside the files we already gather - so that consumers that want to be more locale-aware can use this to improve their user experience.

Open Question: Would we use these gathered names as another source of alternate matchNames - I dont know, maybe?

Would it be only one NSI entry? What would be the preset’s name (since this is the name suggestion index)? If the preset is simply named Acer platanoides, no one but a botanist would find it. If we name it “Norway maple”, then only English speakers would find it, while Spanish speakers in Spain would see English all over the preset list.

Originally posted by @1ec5 in https://github.com/osmlab/name-suggestion-index/issues/8324#issuecomment-1615960848

Some examples:

Starbucks: https://www.wikidata.org/wiki/Q37158

Norway Maple: https://www.wikidata.org/wiki/Q26745

1ec5 commented 1 year ago

For context, relying on Wikidata labels and properties would be somewhat unconventional for an OSM-related software project compared to the more common approach of soliciting project-specific translations on a system like Transifex or Translatewiki.net. But there is some prior art, such as the highway shield legend in ZeLonewolf/openstreetmap-americana#632.

For NSI, the biggest advantage to relying on Wikidata would be reducing what would otherwise be a very significant burden on volunteer translators. Besides, most of these translations would go to waste, never seen by anyone. Moreover, Wikidata items are supposed to correspond one-for-one with NSI entries, so we’re leaving a lot of valid translations on the table at the moment. (Sometimes they don’t correspond one-to-one, but that’s a bigger problem that these labels would surface, justifiably in my opinion.)

One thing to watch out for is that Wikidata has a different naming convention for labels than we do for presets. For example, Wikidata expects labels to be capitalized only when necessary, so that a data consumer can insert “smoke tree” in a sentence instead of a more jarring “Smoke tree”. By contrast, in the default American English localization, we currently prefer title case: openstreetmap/id-tagging-schema#473. (Some other languages like French and Spanish prefer sentence case.) NSI will need to recase the labels itself to keep people from seeing the wrong case and annoying the Wikidata community with “tagging for the editor” edits, as the Americana project initially did after landing its Wikidata-powered legend.

LaoshuBaby commented 1 year ago

That would mean we need to maintain a list of "what languages are commonly used in what countries/regions"? Will it bring too much breaking changes?

1ec5 commented 1 year ago

I’m not sure why such a list would be necessary. The build script would pull in all the labels that Wikidata has for a given operator or flag’s item, then produce a separate sidecar file for each language. It would be up to the client to choose the file appropriate to the language, similar to how interface localization works today.

osmlab / name-suggestion-index

Distribute another file that contains that localized names of everything #8381