unicode-org / unicodetools

home of unicodetools and https://util.unicode.org JSPs
https://util.unicode.org
Other
52 stars 41 forks source link

Old files (3.0 and earlier) are not parsed correctly #330

Open macchiati opened 2 years ago

macchiati commented 2 years ago

The Proplist file used to have a different structure, eg

Property dump for: 0x10000001 (Zero-width)

200B..200F  (5 chars)
202A..202E  (5 chars)
206A..206F  (6 chars)
FEFF

Property dump for: 0x10000004 (White space)

0000
0009..000D  (5 chars)
0020
00A0
2000..200F  (16 chars)
2028..202E  (7 chars)
206A..206F  (6 chars)
3000
FEFF

The IndexUnicodeProperties code doesn't handle that correctly. It needs to detect the older style and read it as if it were the modern style:

0009..000D    ; White_Space # Cc   [5] <control-0009>..<control-000D>
0020          ; White_Space # Zs       SPACE
0085          ; White_Space # Cc       <control-0085>
00A0          ; White_Space # Zs       NO-BREAK SPACE
1680          ; White_Space # Zs       OGHAM SPACE MARK
2000..200A    ; White_Space # Zs  [11] EN QUAD..HAIR SPACE
2028          ; White_Space # Zl       LINE SEPARATOR
2029          ; White_Space # Zp       PARAGRAPH SEPARATOR
202F          ; White_Space # Zs       NARROW NO-BREAK SPACE
205F          ; White_Space # Zs       MEDIUM MATHEMATICAL SPACE
3000          ; White_Space # Zs       IDEOGRAPHIC SPACE

That is, when "Property dump for:" is detected, pick up the part in parens, normalize that, and assume that is the 2nd field. There are some more oddities that should be investigated

ASCII_Hex_Digit can't load in 2.0.0.0   ASCII_Hex_Digit( from: 2.0.0-Update/PropList.txt)
ASCII_Hex_Digit can't load in 2.1.0.0   ASCII_Hex_Digit( from: 2.0.0-Update/PropList.txt)
ASCII_Hex_Digit can't load in 3.0.0.0   ASCII_Hex_Digit( from: 3.0.0-Update/PropList.txt)
ASCII_Hex_Digit can't load in 3.2.0.0   ASCII_Hex_Digit( from: 3.2.0-Update/PropList.txt)
ASCII_Hex_Digit can't load in 4.0.0.0   ASCII_Hex_Digit( from: 4.0.0-Update/PropList.txt)
ASCII_Hex_Digit can't load in 4.1.0.0   ASCII_Hex_Digit( from: 4.1.0-Update/PropList.txt)
Basic_Emoji can't load in 8.0.0.0   Basic_Emoji( from: /Users/markdavis/github/unicodetools/unicodetools/data/emoji/3.0/emoji-sequences.txt)
Basic_Emoji can't load in 9.0.0.0   Basic_Emoji( from: /Users/markdavis/github/unicodetools/unicodetools/data/emoji/4.0/emoji-sequences.txt)
Creating File: /Users/markdavis/github/cldr-staging/docs/charts/42/tsv/locale-growth.tsv
FC_NFKC_Closure can't load in 3.2.0.0   FC_NFKC_Closure( from: 3.2.0-Update/DerivedNormalizationProps.txt)
FC_NFKC_Closure can't load in 4.0.0.0   FC_NFKC_Closure( from: 4.0.0-Update/DerivedNormalizationProps.txt)
Identifier_Status can't load in 6.3.0.0 Identifier_Status( from: /Users/markdavis/github/unicodetools/unicodetools/data/security/6.3.0/IdentifierStatus.txt)
Identifier_Status can't load in 7.0.0.0 Identifier_Status( from: /Users/markdavis/github/unicodetools/unicodetools/data/security/7.0.0/IdentifierStatus.txt)
Identifier_Status can't load in 8.0.0.0 Identifier_Status( from: /Users/markdavis/github/unicodetools/unicodetools/data/security/8.0.0/IdentifierStatus.txt)
Identifier_Type can't load in 6.3.0.0   Identifier_Type( from: /Users/markdavis/github/unicodetools/unicodetools/data/security/6.3.0/IdentifierType.txt)
Identifier_Type can't load in 7.0.0.0   Identifier_Type( from: /Users/markdavis/github/unicodetools/unicodetools/data/security/7.0.0/IdentifierType.txt)
Identifier_Type can't load in 8.0.0.0   Identifier_Type( from: /Users/markdavis/github/unicodetools/unicodetools/data/security/8.0.0/IdentifierType.txt)
markusicu commented 2 years ago

What do you/we need this for?

macchiati commented 2 years ago

I happened across this when looking at the growth of properties over time. So not a big priority.