opengeospatial / ogcapi-features

An open standard for querying geospatial information on the web.
https://ogcapi.ogc.org/features
Other
342 stars 85 forks source link

CQL2: (AT) Table 9 wrongly expects københavn matching kobenhavn #850

Closed jerstlouis closed 8 months ago

jerstlouis commented 1 year ago

københavn NFD decomposition is still københavn without the stroke on the o being separated out (e.g., try here: https://dencode.com/en/string/unicode-normalization), so no non-spacing mark gets stripped and an accenti() comparison with kobenhavn should be false.

(NOTE: a different Unicode character sequence of one or multiple codepoints could still be rendered exactly the same visually -- that is the idea of of a canonical decomposition, so that makes it somewhat difficult to test all this)

00F8;LATIN SMALL LETTER O WITH STROKE;Ll;0;L;;;;;N;LATIN SMALL LETTER O SLASH;;00D8;;00D8

Field 5 (6th field) is the empty decomposition mapping meaning this character has no decomposition.

ACCENTI should NOT match it with kobenhavn, as it seems to be expected to several places in Table 9.

That would be consistent with the expected result in the last entry of Table 8:

CASEI(name) IN (casei('Kiev'), casei('kobenhavn'), casei('Berlin'), casei('athens'), casei('foo')) = 3

where it only expects Kiev, Beriln and Athens to match.

Also as noted in https://github.com/opengeospatial/ogcapi-features/issues/847#issuecomment-1634591254 , Requirement 9 is missing a clause about actually stripping the accents (nonspacing marks) after the normalization. Normalization by itself does not remove accents, it only ensures a particular way to encode them.

@cportele @pvretano @pcdion

Some tests that should be added:

Though such names may not exist in the Natural Earth test dataset, a literal-literal comparison would be useful. We will try to come up with examples.

Here are some tests for ACCENTI:

NOTE: This has me wondering whether there should be another function that compares string using Unicode normalization WITHOUT stripping accents, or even whether that behavior should be expected in general when ACCENTI() is NOT used?

That is the following really are equivalent string as far as Unicode is concerned, irrespective of ignoring accents, even though the UTF-8 encoding is different:

I think there should at least be a permission to recognize these strings as equivalent in Basic CQL2.

Here are some tests for CASEI:

NOTE: the Unicode normalization tests can be found at https://www.unicode.org/Public/UCD/latest/ucd/NormalizationTest.txt

jerstlouis commented 1 year ago

I looked into this a bit more and found the following resources on character folding:

Normalization followed by stripping some characters (e.g., non-spacing marks) will not take care of several of these types of character folding, such as replacing ø by o, or matching katakana and hiragana equivalent (e.g., and ), or matching æ to ae.

However, according to that proposed report (first link, from 2002), it seems that nothing has yet been standardized in Unicode to define character folding (beyond the decomposition mapping and stripping combining marks), and this is only at a technical report stage. There are 40 different types of character foldings enumerated in the rows of that table in 4.2 Specification of folding operations, so the complexity of to implement all those cases of character folding is daunting. It states in the last column that Data file specifying the mapping are TBD specifically for:

For several of the rows in that the compatibility decomposition (NFKD) (as opposed to the canonical decomposition (NFD)) could be used, which is available from UnicodeData.txt. It would be good to clarify which type of decomposition is expected to be performed, NFD or NFKD. I had so far assumed NFD as that seems to be what is most commonly used and talked about for accent removal.

If the diacritics are not included in the folding, the mention of diacritics should be removed from the spec at the beginning of the section. In this report, it is confusiong that the first row is called accent folding and the next diacritic folding row exists that uses a TBD file called AccentFolding.txt. Are diacritics considered accents or not? I'm confused! (I think accents are a particular type of diacritics) Perhaps another function could be defined for wider character folding including diacritic folding, kana folding, letterforms folding, compatibility decomposition folding, space folding?

There are references to an FTP directory ftp://ftp.unicode.org/Public/UNIDATA/ that no longer exists where an AccentFolding.txt might once have existed. I also noticed this https://www.unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt which lists other equivalent characters, but no latin characters.

There are some example foldings here (including folding ø into o):

https://stackoverflow.com/questions/3686020/case-sensitive-accent-folding-in-javascript

But since it seems that Unicode does not standardize diacritic folding, if we want CQL2 ACCENTI() to fold diacritics, we would need to provide an explicit list of what gets folded into what.

jerstlouis commented 1 year ago

Ahh!! Things are getting even more interesting. I found a newer Draft report from 2004: http://www.unicode.org/reports/tr30/tr30-4.html

which was actually withdrawn. ([116-C8] Consensus: Withdraw Draft Unicode Technical Report # 30: Unicode Character Foldings with the understanding that it could be revived in the future.)

In that report, in addition to the CaseFolding and folding that can be performed based on canonical (NFD) or compatibility (NFKD) decomposition, it introduces multiple types of character foldings that are qualified as provisional, for these there are working links to these 9 data files:

There is also this file:

https://www.unicode.org/reports/tr30/datafiles/Foldings.txt

which summarizes what to do for each type of folding and references those files.

I think none of this however will match æ to ae, or œ to oe since they do not have a decomposition and are not included in any of these files:

00E6;LATIN SMALL LETTER AE;Ll;0;L;;;;;N;LATIN SMALL LETTER A E;;00C6;;00C6 0153;LATIN SMALL LIGATURE OE;Ll;0;L;;;;;N;LATIN SMALL LETTER O E;;0152;;0152

So we would need to clarify:

As I mention in #847, it is quite the treasure hunt to figure this stuff out, so it would be good to provide the guidance in the spec :)

cportele commented 8 months ago

@jerstlouis @pvretano

I cannot say that I really understand the nuances (or why the normalized ø is still a single char), but I will just change the tests from "købnhavn" to "Chișinău". I will do this in #857.

I think we will have to accept that this is a complex topic and that there will be (edge?) cases where the current accenti requirements may not lead to the expected result. And I don't think that we should do more than we already have in #857 on the subject. If there is a clear description somewhere that we can reference we should do that and add it to the bibliography. But we should not add more text on this to the CQL2 standard.