unicode-org / unicodetools

home of unicodetools and https://util.unicode.org JSPs
https://util.unicode.org
Other
52 stars 40 forks source link

do we need CollatorType.cldrWithoutFFFx? #794

Open markusicu opened 6 months ago

markusicu commented 6 months ago

WriteCollationData.getCollator(type) (issue #793 would move this function to class UCA) works with three types, one is cldrWithoutFFFx which builds a CLDR collator except that it leaves U+FFFE and U+FFFF with their DUCET mappings rather than their CLDR tailorings.

Strangely, FractionalUCA.java works with such a collator, even though it writes "SPECIAL MAX/MIN COLLATION ELEMENTS" for these noncharacters, corresponding to the CLDR tailorings.

This type is also used for UCA.Main option testCompatibilityCharacters.

Why? It seems confusing to have this third type, especially to get something different from what we actually output. Try to remove it and only use either a DUCET collator or a CLDR collator.

If we need and keep this option, then at least consider changing buildCldrCollator(boolean) to buildCldrCollator(enum type) for readability.

@macchiati FYI

markusicu commented 2 months ago

@macchiati do we need the cldrWithoutFFFx option?

macchiati commented 2 months ago

Hmmm. As I recall, the FFFE and FFFF are to allow users to have minimum and maximum collation elements. As long as we continue to keep those in the CLDR data, I think we are ok.

markusicu commented 2 months ago

Hmmm. As I recall, the FFFE and FFFF are to allow users to have minimum and maximum collation elements. As long as we continue to keep those in the CLDR data, I think we are ok.

Of course we are going to keep them in CLDR. --> https://www.unicode.org/reports/tr35/tr35-collation.html#tailored_noncharacter_weights

It (still) makes sense that we have two choices for collators, but why three? class UCA -->

    public enum CollatorType {
        ducet,
        cldr,
        cldrWithoutFFFx
    }
macchiati commented 2 months ago

I don't recall any reason to keep the without ..

On Wed, Aug 21, 2024, 08:51 Markus Scherer @.***> wrote:

Hmmm. As I recall, the FFFE and FFFF are to allow users to have minimum and maximum collation elements. As long as we continue to keep those in the CLDR data, I think we are ok.

Of course we are going to keep them in CLDR. --> https://www.unicode.org/reports/tr35/tr35-collation.html#tailored_noncharacter_weights

It (still) makes sense that we have two choices for collators, but why three? class UCA -->

public enum CollatorType {
    ducet,
    cldr,
    cldrWithoutFFFx
}

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/unicodetools/issues/794#issuecomment-2302423105, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMBJWGYNKRH7AWL2UUTZSSZPTAVCNFSM6AAAAABHCPK3DWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBSGQZDGMJQGU . You are receiving this because you were mentioned.Message ID: @.***>

markusicu commented 2 months ago

Thanks. Setting priority=high because the question is resolved, and it looks like the code change will be easy.