CQL2: (AT) Table 9 wrongly expects københavn matching kobenhavn

københavn NFD decomposition is still københavn without the stroke on the o being separated out (e.g., try here: https://dencode.com/en/string/unicode-normalization), so no non-spacing mark gets stripped and an accenti() comparison with kobenhavn should be false.

(NOTE: a different Unicode character sequence of one or multiple codepoints could still be rendered exactly the same visually -- that is the idea of of a canonical decomposition, so that makes it somewhat difficult to test all this)

00F8;LATIN SMALL LETTER O WITH STROKE;Ll;0;L;;;;;N;LATIN SMALL LETTER O SLASH;;00D8;;00D8

Field 5 (6th field) is the empty decomposition mapping meaning this character has no decomposition.

ACCENTI should NOT match it with kobenhavn, as it seems to be expected to several places in Table 9.

That would be consistent with the expected result in the last entry of Table 8:

CASEI(name) IN (casei('Kiev'), casei('kobenhavn'), casei('Berlin'), casei('athens'), casei('foo')) = 3

where it only expects Kiev, Beriln and Athens to match.

Also as noted in https://github.com/opengeospatial/ogcapi-features/issues/847#issuecomment-1634591254 , Requirement 9 is missing a clause about actually stripping the accents (nonspacing marks) after the normalization. Normalization by itself does not remove accents, it only ensures a particular way to encode them.

@cportele @pvretano @pcdion

Some tests that should be added:

testing recursive decompositions,
testing strings already decomposed using a non-canonical form,
for CASEI(), testing full case folding from one to multiple characters

Though such names may not exist in the Natural Earth test dataset, a literal-literal comparison would be useful. We will try to come up with examples.

Here are some tests for ACCENTI:

Recursive decomposition: ACCENTI('Ḕ') = ACCENTI('E') should be true (U+1E15 -> U+0113 U+0300 -> U+0065 U+0304 U+0300)
non-canonical decomposed source: ACCENTI('Ḍ̛̇') = ACCENTI('D') (source fully decomposed but in non-canonical order U+0044 U+031B U+0307 U+0323) should be true -- in this case canonical re-ordering shouldn't matter because the non-spacing marks are stripped anwyays
non-canonical decomposed source: ACCENTI('Ḍ̛̇') = ACCENTI('D') (source partially composed U+1E0A U+031B U+0323, also requiring decomposition) should be true -- in this case canonical re-ordering shouldn't matter because the non-spacing marks are stripped anwyays
I tried to find an example where canonical ordering might matter, but it seems like all examples of decompositions where re-ordering would matter are combining marks which are classified as non-spacing marks (Mn) and would be stripped anyways. A counter-example of canonical ordering test (that doesn't matter for ACCENTI): é̂ (U+0065 U+0302 U+0301) is not equivalent to ế (U+0065 U+0301 U+0302) because the circumflex and acute accent are both of combining class 230 and therefore should not be re-ordered.

NOTE: This has me wondering whether there should be another function that compares string using Unicode normalization WITHOUT stripping accents, or even whether that behavior should be expected in general when ACCENTI() is NOT used?

That is the following really are equivalent string as far as Unicode is concerned, irrespective of ignoring accents, even though the UTF-8 encoding is different:

'Ḍ̛̇' = 'Ḍ̛̇'
'Amélie' = 'Amélie'

I think there should at least be a permission to recognize these strings as equivalent in Basic CQL2.

Here are some tests for CASEI:

Not using special case folding for Turkish languages: CASEI('İ') = CASEI('i') should be FALSE (only true with the special case folding used for Turkish languages, 'T' in CaseFolding.txt)
Full case folding to multiple codepoints: CASEI('İ') = CASEI('i̇') should be true (unlike the above, this lowercase i̇ on the right-hand side differs from i with the dot slightly lower and includes the extra U+0307 codepoint, İi̇i Unicode is so much fun!)
Full case folding to multiple codepoints: CASEI('ǰ') = CASEI('ǰ') should be true
Full case folding to multiple codepoints: CASEI('ΰ') = CASEI('ΰ') should be true
Full case folding to multiple codepoints: CASEI('և') = CASEI('եւ') should be true
Full case folding to multiple codepoints: CASEI('ῼ') = CASEI('ωι') should be true
Simple vs. Full case folding: CASEI('ῼ') = CASEI('ῳ') -- Scratching what I wrote here earlier. Because CASEI is applied on both sides, the full case folding will make this equivalent (U+03C9 U+003B). In an implementation where CASEI() returns the case-folded string, CASEI('ῼ') = 'ῳ' will be false however, while CASEI('ῼ') = 'ωι' will be true.

NOTE: the Unicode normalization tests can be found at https://www.unicode.org/Public/UCD/latest/ucd/NormalizationTest.txt

I looked into this a bit more and found the following resources on character folding:

Normalization followed by stripping some characters (e.g., non-spacing marks) will not take care of several of these types of character folding, such as replacing ø by o, or matching katakana and hiragana equivalent (e.g., か and カ), or matching æ to ae.

However, according to that proposed report (first link, from 2002), it seems that nothing has yet been standardized in Unicode to define character folding (beyond the decomposition mapping and stripping combining marks), and this is only at a technical report stage. There are 40 different types of character foldings enumerated in the rows of that table in 4.2 Specification of folding operations, so the complexity of to implement all those cases of character folding is daunting. It states in the last column that Data file specifying the mapping are TBD specifically for:

Diacritic folding (includes stroke, hook, descender) (AccentFolding.txt [TBD])
Kana folding (KanaFolding.txt [TBD])
Letterforms folding (LetterFolding.txt [TBD])

For several of the rows in that the compatibility decomposition (NFKD) (as opposed to the canonical decomposition (NFD)) could be used, which is available from UnicodeData.txt. It would be good to clarify which type of decomposition is expected to be performed, NFD or NFKD. I had so far assumed NFD as that seems to be what is most commonly used and talked about for accent removal.

If the diacritics are not included in the folding, the mention of diacritics should be removed from the spec at the beginning of the section. In this report, it is confusiong that the first row is called accent folding and the next diacritic folding row exists that uses a TBD file called AccentFolding.txt. Are diacritics considered accents or not? I'm confused! (I think accents are a particular type of diacritics) Perhaps another function could be defined for wider character folding including diacritic folding, kana folding, letterforms folding, compatibility decomposition folding, space folding?

There are references to an FTP directory ftp://ftp.unicode.org/Public/UNIDATA/ that no longer exists where an AccentFolding.txt might once have existed. I also noticed this https://www.unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt which lists other equivalent characters, but no latin characters.

There are some example foldings here (including folding ø into o):

https://stackoverflow.com/questions/3686020/case-sensitive-accent-folding-in-javascript

But since it seems that Unicode does not standardize diacritic folding, if we want CQL2 ACCENTI() to fold diacritics, we would need to provide an explicit list of what gets folded into what.

Ahh!! Things are getting even more interesting. I found a newer Draft report from 2004: http://www.unicode.org/reports/tr30/tr30-4.html

which was actually withdrawn. ([116-C8] Consensus: Withdraw Draft Unicode Technical Report # 30: Unicode Character Foldings with the understanding that it could be revived in the future.)

In that report, in addition to the CaseFolding and folding that can be performed based on canonical (NFD) or compatibility (NFKD) decomposition, it introduces multiple types of character foldings that are qualified as provisional, for these there are working links to these 9 data files:

Diacritics (http://www.unicode.org/reports/tr30/datafiles/DiacriticFolding.txt) -- this would allow to fold københavn into kobenhavn
Hiragana (https://www.unicode.org/reports/tr30/datafiles/HiraganaFolding.txt)
Katakana (https://www.unicode.org/reports/tr30/datafiles/KatakanaFolding.txt)
Superscripts (http://www.unicode.org/reports/tr30/datafiles/SuperscriptFolding.txt)
Letterforms (http://www.unicode.org/reports/tr30/datafiles/LetterformFolding.txt)
Widths (http://www.unicode.org/reports/tr30/datafiles/WidthFolding.txt)
Han radicals (http://www.unicode.org/reports/tr30/datafiles/HanRadicalFolding.txt)
Simplified han foldings (https://www.unicode.org/reports/tr30/datafiles/SuperscriptFolding.txt)
Suzhou numerals (http://www.unicode.org/reports/tr30/datafiles/SuzhouFolding.txt)

There is also this file:

https://www.unicode.org/reports/tr30/datafiles/Foldings.txt

which summarizes what to do for each type of folding and references those files.

I think none of this however will match æ to ae, or œ to oe since they do not have a decomposition and are not included in any of these files:

00E6;LATIN SMALL LETTER AE;Ll;0;L;;;;;N;LATIN SMALL LETTER A E;;00C6;;00C6 0153;LATIN SMALL LIGATURE OE;Ll;0;L;;;;;N;LATIN SMALL LETTER O E;;0152;;0152

So we would need to clarify:

Which of these types of folding do we want ACCENTI() to perform?
Do we want to use NFKD or NFD (do we want to use one or the other based on the character category or decomposition type? That report distinguishes between several categories where compatibility folding might be used with the associated implications)
Which marks exactly should be stripped after decomposition (only Mn non-spacing marks?)
Do we want to also fold ideographs and/or emojis?

As I mention in #847, it is quite the treasure hunt to figure this stuff out, so it would be good to provide the guidance in the spec :)

@jerstlouis @pvretano

I cannot say that I really understand the nuances (or why the normalized ø is still a single char), but I will just change the tests from "købnhavn" to "Chișinău". I will do this in #857.

I think we will have to accept that this is a complex topic and that there will be (edge?) cases where the current accenti requirements may not lead to the expected result. And I don't think that we should do more than we already have in #857 on the subject. If there is a clear description somewhere that we can reference we should do that and add it to the bibliography. But we should not add more text on this to the CQL2 standard.

opengeospatial / ogcapi-features

CQL2: (AT) Table 9 wrongly expects københavn matching kobenhavn #850