swiftlang / swift-foundation

The Foundation project
Apache License 2.0
2.41k stars 160 forks source link

Swedish å,ä, ö are treated as diacritics #567

Open jhansbo opened 6 months ago

jhansbo commented 6 months ago

The Scandinavian languages and the Finnish language, by contrast, treat the characters with diacritics å, ä, and ö as distinct letters of the alphabet, and sort them after z. Usually ä (a-umlaut) and ö (o-umlaut) [used in Swedish and Finnish] are sorted as equivalent to æ (ash) and ø (o-slash) [used in Danish and Norwegian]. Also, aa, when used as an alternative spelling to å, is sorted as such. Other letters modified by diacritics are treated as variants of the underlying letter, with the exception that ü is frequently sorted as y.

import Foundation

let symbol = "The Swedish letters Å, Ä, Ö" let string = "a" let symbolRange = symbol.range(of: string, options: [.caseInsensitive, .diacriticInsensitive])

if let range = symbolRange { print("Found (string) in (symbol)") } else { print("(string) not found in (symbol)") }

Prints Found 'a' in 'The Swedish letters Å, Ä, Ö'

Should print 'a' not found in 'The Swedish letters Å, Ä, Ö'

Replacing the string with "o" — same issue.

hassila commented 6 months ago

Right, as a Swedish native speaker the current behavior is very strange - if matching without diacritics we get what is the incorrect result really.

jhansbo commented 6 months ago

Also note that sorting will be incorrect. A, B, C, D, ...., X, Y, Z, Å, Ä, Ö is the correct sorting order for the Swedish alphabet. In a Swedish dictionary é is sorted along with e and ü is sorted along with u (both are true diacritics), but å and ä are not sorted along with a and ö is not sorted along with o.

As mentioned, this is a problem also for Norwegian and Danish. It's peculiar that only Å and Æ are considered diacritics (Danish equivalent of Å and Ä) but Ø is not (Danish equivalent of Ö).

itingliu commented 4 months ago

This API range(of:, options:) isn't locale/language aware. While these letters are distinct letters in Swedish, they are indeed diacritics in other languages, so it's challenging to make that distinction here.

That being said, I would definitely expect the localized version of this API, e.g. range(of: string, options: [.caseInsensitive, .diacriticInsensitive], locale: Locale(languageCode: .swedish)) to return what you described, but it isn't currently. Would you agree that we should track that issue instead?

hassila commented 4 months ago

It seems there are way more languages treating them as separate letters

See e.g.

https://en.wikipedia.org/wiki/Å#:~:text=It%20is%20a%20separate%20letter,Pamirian%20languages%2C%20and%20Greenlandic%20alphabets.

But as there is no single correct answer, moving this case to be for the Swedish locale would be ok I think. (Although I think the locale-unaware default is debatable, I guess it's been that way for some time...)