Closed jerstlouis closed 8 months ago
I looked into this a bit more and found the following resources on character folding:
Normalization followed by stripping some characters (e.g., non-spacing marks) will not take care of several of these types of character folding, such as replacing ø
by o
, or matching katakana and hiragana equivalent (e.g., か
and カ
), or matching æ
to ae
.
However, according to that proposed report (first link, from 2002), it seems that nothing has yet been standardized in Unicode to define character folding (beyond the decomposition mapping and stripping combining marks), and this is only at a technical report stage. There are 40 different types of character foldings enumerated in the rows of that table in 4.2 Specification of folding operations, so the complexity of to implement all those cases of character folding is daunting. It states in the last column that Data file specifying the mapping are TBD specifically for:
For several of the rows in that the compatibility decomposition (NFKD) (as opposed to the canonical decomposition (NFD)) could be used, which is available from UnicodeData.txt. It would be good to clarify which type of decomposition is expected to be performed, NFD or NFKD. I had so far assumed NFD as that seems to be what is most commonly used and talked about for accent removal.
If the diacritics are not included in the folding, the mention of diacritics should be removed from the spec at the beginning of the section. In this report, it is confusiong that the first row is called accent folding and the next diacritic folding row exists that uses a TBD file called AccentFolding.txt. Are diacritics considered accents or not? I'm confused! (I think accents are a particular type of diacritics) Perhaps another function could be defined for wider character folding including diacritic folding, kana folding, letterforms folding, compatibility decomposition folding, space folding?
There are references to an FTP directory ftp://ftp.unicode.org/Public/UNIDATA/ that no longer exists where an AccentFolding.txt might once have existed. I also noticed this https://www.unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt which lists other equivalent characters, but no latin characters.
There are some example foldings here (including folding ø
into o
):
https://stackoverflow.com/questions/3686020/case-sensitive-accent-folding-in-javascript
But since it seems that Unicode does not standardize diacritic folding, if we want CQL2 ACCENTI()
to fold diacritics, we would need to provide an explicit list of what gets folded into what.
Ahh!! Things are getting even more interesting. I found a newer Draft report from 2004: http://www.unicode.org/reports/tr30/tr30-4.html
which was actually withdrawn. ([116-C8] Consensus: Withdraw Draft Unicode Technical Report # 30: Unicode Character Foldings with the understanding that it could be revived in the future.)
In that report, in addition to the CaseFolding and folding that can be performed based on canonical (NFD) or compatibility (NFKD) decomposition, it introduces multiple types of character foldings that are qualified as provisional, for these there are working links to these 9 data files:
There is also this file:
https://www.unicode.org/reports/tr30/datafiles/Foldings.txt
which summarizes what to do for each type of folding and references those files.
I think none of this however will match æ
to ae
, or œ
to oe
since they do not have a decomposition and are not included in any of these files:
00E6;LATIN SMALL LETTER AE;Ll;0;L;;;;;N;LATIN SMALL LETTER A E;;00C6;;00C6 0153;LATIN SMALL LIGATURE OE;Ll;0;L;;;;;N;LATIN SMALL LETTER O E;;0152;;0152
So we would need to clarify:
ACCENTI()
to perform?Mn
non-spacing marks?)As I mention in #847, it is quite the treasure hunt to figure this stuff out, so it would be good to provide the guidance in the spec :)
@jerstlouis @pvretano
I cannot say that I really understand the nuances (or why the normalized ø is still a single char), but I will just change the tests from "købnhavn" to "Chișinău". I will do this in #857.
I think we will have to accept that this is a complex topic and that there will be (edge?) cases where the current accenti
requirements may not lead to the expected result. And I don't think that we should do more than we already have in #857 on the subject. If there is a clear description somewhere that we can reference we should do that and add it to the bibliography. But we should not add more text on this to the CQL2 standard.
københavn
NFD decomposition is stillkøbenhavn
without the stroke on theo
being separated out (e.g., try here: https://dencode.com/en/string/unicode-normalization), so no non-spacing mark gets stripped and anaccenti()
comparison withkobenhavn
should be false.(NOTE: a different Unicode character sequence of one or multiple codepoints could still be rendered exactly the same visually -- that is the idea of of a canonical decomposition, so that makes it somewhat difficult to test all this)
Field 5 (6th field) is the empty decomposition mapping meaning this character has no decomposition.
ACCENTI
should NOT match it withkobenhavn
, as it seems to be expected to several places in Table 9.That would be consistent with the expected result in the last entry of Table 8:
CASEI(name) IN (casei('Kiev'), casei('kobenhavn'), casei('Berlin'), casei('athens'), casei('foo'))
=3
where it only expects Kiev, Beriln and Athens to match.
Also as noted in https://github.com/opengeospatial/ogcapi-features/issues/847#issuecomment-1634591254 , Requirement 9 is missing a clause about actually stripping the accents (nonspacing marks) after the normalization. Normalization by itself does not remove accents, it only ensures a particular way to encode them.
@cportele @pvretano @pcdion
Some tests that should be added:
CASEI()
, testing full case folding from one to multiple charactersThough such names may not exist in the Natural Earth test dataset, a literal-literal comparison would be useful. We will try to come up with examples.
Here are some tests for ACCENTI:
ACCENTI('Ḕ') = ACCENTI('E')
should be true (U+1E15 -> U+0113 U+0300 -> U+0065 U+0304 U+0300)Mn
) and would be stripped anyways. A counter-example of canonical ordering test (that doesn't matter for ACCENTI):é̂
(U+0065 U+0302 U+0301) is not equivalent toế
(U+0065 U+0301 U+0302) because the circumflex and acute accent are both of combining class 230 and therefore should not be re-ordered.NOTE: This has me wondering whether there should be another function that compares string using Unicode normalization WITHOUT stripping accents, or even whether that behavior should be expected in general when ACCENTI() is NOT used?
That is the following really are equivalent string as far as Unicode is concerned, irrespective of ignoring accents, even though the UTF-8 encoding is different:
I think there should at least be a permission to recognize these strings as equivalent in Basic CQL2.
Here are some tests for CASEI:
CASEI('İ') = CASEI('i')
should be FALSE (only true with the special case folding used for Turkish languages, 'T' in CaseFolding.txt)CASEI('İ') = CASEI('i̇')
should be true (unlike the above, this lowercase i̇ on the right-hand side differs from i with the dot slightly lower and includes the extra U+0307 codepoint, İi̇i Unicode is so much fun!)CASEI('ǰ') = CASEI('ǰ')
should be trueCASEI('ΰ') = CASEI('ΰ')
should be trueCASEI('և') = CASEI('եւ')
should be trueCASEI('ῼ') = CASEI('ωι')
should be trueCASEI('ῼ') = CASEI('ῳ')
-- Scratching what I wrote here earlier. Because CASEI is applied on both sides, the full case folding will make this equivalent (U+03C9 U+003B). In an implementation whereCASEI()
returns the case-folded string,CASEI('ῼ') = 'ῳ'
will be false however, whileCASEI('ῼ') = 'ωι'
will be true.NOTE: the Unicode normalization tests can be found at https://www.unicode.org/Public/UCD/latest/ucd/NormalizationTest.txt