unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.39k stars 178 forks source link

Sentence segmentation is incorrect #4038

Closed eggrobin closed 1 year ago

eggrobin commented 1 year ago

Note that while I noticed it while working on the 15.1 update, this issue is about 15.0 segmentation as implemented in ICU4X with 15.0 property assignments, compared to that defined by the 15.0 standard (not that anything changed for sentence segmentation in 15.1).

I generated some tests using the ICU4C monkeys (at https://github.com/eggrobin/icu/commit/b1612851e4e715c37279a74bfcd97d4f2056fd0c, with seed 1729). The following test fails:

÷ FF1A × 000D ÷ 0ED8 × 14BE5 × 274B × 1EF9 × 1D4D2 × FE58 × FE55 × 2992 × 29D9 × 0028 × 2024 × 0D42 × 110BD ÷ E7BC × 2024 × 2024 × FF1A × FEFF × 1113E × 118BE × 11C9E × 1D4F1 × 00A0 × ECB6 × 057F × 0965 × 2028 ÷ FF63 × 2024 × E0184 ÷ 11B74 × 9FF3 × 0891 × 2007 × 1EFD × 2C2A × 2009 × FE5D × 2062 × 13435 × FF1A × 5F5A × 13435 × FF1A × 1A59 × 2006 × 12919 × 002E × FF0E × 1C41 × 061D × 2007 × 000A ÷ 1343C × 13437 × 111BF × 003A × 007D × FE55 × 205F × 104A ÷ 0272 × 002E × 002C × 11238 × 1C7F ÷ 1C53 × 1714 × 000D ÷ 1344B × 1B5E ÷ 1D60D × 0C6A × 2006 × 2024 × 1BCA0 ÷ 13D3 × 02AF × F288 × 2172 × 115C3 ÷ A034 × FF0D × 10F55 × 0FB0 ÷ 2C1C × 13A1 × 16B37 × 2024 × 0022 × 200B × 206C × 002E × 0A75 × 1DA3B ÷ 91C8 × A6F7 × 2047 ÷ 2ED9 × FE57 ÷ 01DA × 14B81 × 1736 ÷ 00F0 × FF5B × FF1F × 07F8 × 0195 × 2024 ÷ 83C7 × 000A ÷ 13CD × 1F56 × 0085 ÷ 20E6 × 01D6 × 1DA53 × 298C × 2049 × 1802 × 060D × 000A ÷ 1BCA3 × FE13 × 0A3F × FE52 × 24E6 × 2000 × 0DD9 × 2C76 × FF0E × 000A ÷ 002E × 13435 × 002E × 000D ÷ 206F × 1D77F × 002E × 1C58 × 1E005 × FE52 × FF0D × A6F3 × 275E × 202C × 1935 × 206D × 000D ÷ FE52 × FE43 ÷ 612C × 13438 × 2028 ÷ 3000 × 003A × FF0E × 206E ÷ 052A × 002E × 1945 ÷ EE47 × 3000 × 1090 × A60F ÷ 28F5 × 1712 × 104CF × 11EF6 × 0661 × 759B × 0962 × E76E × 13432 × FF0E × 000A ÷ 2000 × 16AF5 × 1DE9 × 2E04 × FF0C × 3000 × 000D ÷ FE63 × 206C × 000A ÷ 06F6 × 000D ÷ FF0E × 3001 × 16F71 × 0414 × FD3E × 0665 × FF09 × 0027 × 12870 × 000D ÷ 0ED8 × 000D ÷ 115CB × 1033 × A950 × 2063 × 300F ÷ 773F × 0009 × 2029 ÷ 0890 × 2002 × FE0A × 2C8A × 09ED × FE5D × 2029 ÷ 15234 × 2E3C ÷ A9D7 × 061C × 000A ÷ 0496 × 002E × 3000 ÷ 4130 × 2008 × 002E × 1A86 × 000D ÷ 112A9 ÷ 5F2F × 0029 × 2024 ÷ 1F2E × 12CB4 × 118AC × 24D2 × 000D ÷ 1D402 × 11BBA × 2028 ÷ 2C99 × 0C66 × 11048 × 10A57 × FF0E × 0456 × FF0E × 2000 × 205F × 202A ÷ 1516B × 3001 × 111D3 × EEEC × 1045 × 0085 ÷ 1BCA0 × A9B9 × 119D3 × 5F09 × 20DD × 0085 ÷ 2C03 × 00AD × 158F0 × 2C85 × 1F678 × FE52 × E7D3 × 27EB × 16A65 × 2008 × 11EC1 × FE58 × 2000 × 11956 × 15B30 × 10B77 × 2C7B × 2007 × 1D179 × 1808 × 205F × 002E × 0DE6 × 9B18 × 2024 × 000A ÷ 0E52 × 2006 × FE11 × 0FB2 × 00AD × 1EFB × 0085 ÷ 1261 × 2024 ÷ 5437 × 2024 × 000A ÷ 1F176 × 118B1 × 000A ÷ FE50 × 2007 × FE13 × 150E6 × 2005 × 13439 × 1D46C × 8E9F × A8D2 × FE10 × 1DF11 × 1E947 × 10A8 × 13F97 × 2E98 × 1FBF7 × 0C67 × 12E45 × 491F × 2002 × 9C12 × 002E × A92D × 0085 ÷ 1808 × 5CB0 × 08E2 × 1D6AA × 23DC × 10CE8 × 00AB × 115C3 × 2029 ÷ 10449 × 0F9F × 04EA × 1D7B3 × 115CA × 002C × 03F9 × 10F56 × 0670 ÷ 5BC5 × FE32 × FE32 × 7EEA × A68B × 09D6 × 07AF × 2002 × 11B79 × 2CB4 × 1D6BC × 00DE × 1498B × 2E00 × 2067 × 104A4 × 206B × 2308 × 76CB × 10413 × 200A × 115E5 × 002E × FF64 × 16A61 × 276C × 2001 × 1D76A × 000A ÷ 3000 × 1D5A1 × FE11 × 774F × 2024 × 1BC9D × 1934 ÷ 1D6F9 × 11136 × 1BA7 × 002E × 0029 ÷ 9616 × 1D4D2 × 1F15E × 002E ÷ 228B × 2000 × 1C7F × 07F8 × FF0E ÷ 96EF × 2009 × FF64 × 1E05F × 301D × 000D ÷ 0890 × 114D5 × 11DA8 × 2DEF × 2024 × FFFB × 118D7 × 1DAA6 × E051 × 1802 × 0CC4 × 06D4 × FE3A ÷ 67DB × 2014 × 129E × 2024 × 00A0 × 2029 ÷ 11DA6 × 2028 ÷ 301A × 115D1 × 16A6E × A69F × 11F36 ÷ F866 × 2029 ÷ 1144B × 055D × 002E × 1D174 ÷ 15C6C × 0241 × A8CF × FE51 × 7188 × 2028 ÷ 2007 × 016E × 9047 × 891E × 0208 × 298E × 206F × 11A34 × 2C40 × 336E × 0020 × 1043 × 0603 × 2006 × 1E04 × 2024 × AA5F ÷ 16B55 × 2028 ÷ 096F × 000D ÷ A92F ÷ 1E2F5 × 11237 × 111B5 × 110BF × 10F89 × 203C ÷ 11DA5 × 200F × 2028 ÷ 1DF5 × 104B6 × 16A60 × 2006 × 104B × FE52 × FE52 × 2069 × 3001 × 1680 × FE10 × 2018 × 203D ÷ 0665 × A692 × 07F8 × 3000 × 2028 ÷ 44CB × 2024 × 002E × 202F × 1F27 × EF0E × 000A ÷ 104B × FF0E ÷  # 🐒

The error messages of the existing tests are not particularly helpful when dealing with large random sequences like that; I tried printing something a little bit more like the output of the ICU4C monkey tests, see below (output from https://github.com/eggrobin/icu4x/commit/e6ac4ee3b8b3327ca992aa20a73957827becb6b1).

Note that SentenceBreak(14) is SContinue; see #4037.

At a glance, it looks like rules SB9 and SB11 are improperly applied.

  | A | E | Code pt. | Sentence_Break | Literal
  | ÷ | ÷ |     FF1A | SentenceBreak(14) | :
  | × | × |     000D |             CR |
  | ÷ | ÷ |     0ED8 |        Numeric | ໘
  | × | × |    14BE5 |          Other | 𔯥
  | × | × |     274B |          Other | ❋
  | × | × |     1EF9 |          Lower | ỹ
  | × | × |    1D4D2 |          Upper | 𝓒
  | × | × |     FE58 | SentenceBreak(14) | ﹘
  | × | × |     FE55 | SentenceBreak(14) | ﹕
  | × | × |     2992 |          Close | ⦒
  | × | × |     29D9 |          Close | ⧙
  | × | × |     0028 |          Close | (
  | × | × |     2024 |          ATerm | ․
  | × | × |     0D42 |         Extend | ൂ
  | × | × |    110BD |         Format | 𑂽
  | ÷ | ÷ |     E7BC |          Other | 
  | × | × |     2024 |          ATerm | ․
  | × | × |     2024 |          ATerm | ․
  | × | × |     FF1A | SentenceBreak(14) | :
  | × | × |     FEFF |         Format | 
  | × | × |    1113E |        Numeric | 𑄾
  | × | × |    118BE |          Upper | 𑢾
  | × | × |    11C9E |         Extend | 𑲞
  | × | × |    1D4F1 |          Lower | 𝓱
  | × | × |     00A0 |             Sp |  
  | × | × |     ECB6 |          Other | 
  | × | × |     057F |          Lower | տ
  | × | × |     0965 |          STerm | ॥
  | × | × |     2028 |            Sep | 

  | ÷ | ÷ |     FF63 |          Close | 」
  | × | × |     2024 |          ATerm | ․
  | × | × |    E0184 |         Extend | 󠆄
  | ÷ | ÷ |    11B74 |          Other | 𑭴
  | × | × |     9FF3 |        OLetter | 鿳
  | × | × |     0891 |         Format | ࢑
  | × | × |     2007 |             Sp |  
  | × | × |     1EFD |          Lower | ỽ
  | × | × |     2C2A |          Upper | Ⱚ
  | × | × |     2009 |             Sp |  
  | × | × |     FE5D |          Close | ﹝
  | × | × |     2062 |         Format | ⁢
  | × | × |    13435 |         Format | 𓐵
  | × | × |     FF1A | SentenceBreak(14) | :
  | × | × |     5F5A |        OLetter | 彚
  | × | × |    13435 |         Format | 𓐵
  | × | × |     FF1A | SentenceBreak(14) | :
  | × | × |     1A59 |         Extend | ᩙ
  | × | × |     2006 |             Sp |  
  | × | × |    12919 |          Other | 𒤙
  | × | × |     002E |          ATerm | .
  | × | × |     FF0E |          ATerm | .
  | × | × |     1C41 |        Numeric | ᱁
  | × | × |     061D |          STerm | ؝
  | × | × |     2007 |             Sp |  
  | × | × |     000A |             LF |

  | ÷ | ÷ |    1343C |         Format | 𓐼
  | × | × |    13437 |         Format | 𓐷
  | × | × |    111BF |         Extend | 𑆿
  | × | × |     003A | SentenceBreak(14) | :
  | × | × |     007D |          Close | }
  | × | × |     FE55 | SentenceBreak(14) | ﹕
  | × | × |     205F |             Sp |  
  | × | × |     104A |          STerm | ၊
  | ÷ | ÷ |     0272 |          Lower | ɲ
  | × | × |     002E |          ATerm | .
  | × | × |     002C | SentenceBreak(14) | ,
  | × | × |    11238 |          STerm | 𑈸
  | × | × |     1C7F |          STerm | ᱿
  | ÷ | ÷ |     1C53 |        Numeric | ᱓
  | × | × |     1714 |         Extend | ᜔
  | × | × |     000D |             CR |
  | ÷ | ÷ |    1344B |         Extend | 𓑋
  | × | × |     1B5E |          STerm | ᭞
  | ÷ | ÷ |    1D60D |          Upper | 𝘍
  | × | × |     0C6A |        Numeric | ౪
  | × | × |     2006 |             Sp |  
  | × | × |     2024 |          ATerm | ․
  | × | × |    1BCA0 |         Format | 𛲠
  | ÷ | ÷ |     13D3 |          Upper | Ꮣ
  | × | × |     02AF |          Lower | ʯ
  | × | × |     F288 |          Other | 
  | × | × |     2172 |          Lower | ⅲ
  | × | × |    115C3 |          STerm | 𑗃
  | ÷ | ÷ |     A034 |        OLetter | ꀴ
  | × | × |     FF0D | SentenceBreak(14) | -
  | × | × |    10F55 |          STerm | 𐽕
  | × | × |     0FB0 |         Extend | ྰ
  | ÷ | ÷ |     2C1C |          Upper | Ⱌ
  | × | × |     13A1 |          Upper | Ꭱ
  | × | × |    16B37 |          STerm | 𖬷
  | × | × |     2024 |          ATerm | ․
  | × | × |     0022 |          Close | "
  | × | × |     200B |         Format | ​
  | × | × |     206C |         Format | 
  | × | × |     002E |          ATerm | .
  | × | × |     0A75 |         Extend | ੵ
  | × | × |    1DA3B |         Extend | 𝨻
  | ÷ | ÷ |     91C8 |        OLetter | 釈
  | × | × |     A6F7 |          STerm | ꛷
  | × | × |     2047 |          STerm | ⁇
  | ÷ | ÷ |     2ED9 |          Other | ⻙
  | × | × |     FE57 |          STerm | ﹗
  | ÷ | ÷ |     01DA |          Lower | ǚ
  | × | × |    14B81 |          Other | 𔮁
  | × | × |     1736 |          STerm | ᜶
  | ÷ | ÷ |     00F0 |          Lower | ð
  | × | × |     FF5B |          Close | {
  | × | × |     FF1F |          STerm | ?
  | × | × |     07F8 | SentenceBreak(14) | ߸
  | × | × |     0195 |          Lower | ƕ
  | × | × |     2024 |          ATerm | ․
  | ÷ | ÷ |     83C7 |        OLetter | 菇
  | × | × |     000A |             LF |

  | ÷ | ÷ |     13CD |          Upper | Ꮝ
  | × | × |     1F56 |          Lower | ὖ
  | × | × |     0085 |            Sep |
  | ÷ | ÷ |     20E6 |         Extend | ⃦
  | × | × |     01D6 |          Lower | ǖ
  | × | × |    1DA53 |         Extend | 𝩓
  | × | × |     298C |          Close | ⦌
  | × | × |     2049 |          STerm | ⁉
  | × | × |     1802 | SentenceBreak(14) | ᠂
  | × | × |     060D | SentenceBreak(14) | ؍
  | × | × |     000A |             LF |

  | ÷ | ÷ |    1BCA3 |         Format | 𛲣
  | × | × |     FE13 | SentenceBreak(14) | ︓
  | × | × |     0A3F |         Extend | ਿ
  | × | × |     FE52 |          ATerm | ﹒
  | × | × |     24E6 |          Lower | ⓦ
  | × | × |     2000 |             Sp |  
  | × | × |     0DD9 |         Extend | ෙ
  | × | × |     2C76 |          Lower | ⱶ
  | × | × |     FF0E |          ATerm | .
  | × | × |     000A |             LF |

  | ÷ | ÷ |     002E |          ATerm | .
  | × | × |    13435 |         Format | 𓐵
  | × | × |     002E |          ATerm | .
  | × | × |     000D |             CR |
  | ÷ | ÷ |     206F |         Format | 
  | × | × |    1D77F |          Lower | 𝝿
  | × | × |     002E |          ATerm | .
  | × | × |     1C58 |        Numeric | ᱘
  | × | × |    1E005 |         Extend | 𞀅
  | × | × |     FE52 |          ATerm | ﹒
  | × | × |     FF0D | SentenceBreak(14) | -
  | × | × |     A6F3 |          STerm | ꛳
  | × | × |     275E |          Close | ❞
  | × | × |     202C |         Format | ‬
  | × | × |     1935 |         Extend | ᤵ
  | × | × |     206D |         Format | 
  | × | × |     000D |             CR |
  | ÷ | ÷ |     FE52 |          ATerm | ﹒
😭| ÷ | × |     FE43 |          Close | ﹃
😭| × | ÷ |     612C |        OLetter | 愬
  | × | × |    13438 |         Format | 𓐸
  | × | × |     2028 |            Sep | 

  | ÷ | ÷ |     3000 |             Sp |  
  | × | × |     003A | SentenceBreak(14) | :
  | × | × |     FF0E |          ATerm | .
  | × | × |     206E |         Format | 
  | ÷ | ÷ |     052A |          Upper | Ԫ
  | × | × |     002E |          ATerm | .
  | × | × |     1945 |          STerm | ᥅
  | ÷ | ÷ |     EE47 |          Other | 
  | × | × |     3000 |             Sp |  
  | × | × |     1090 |        Numeric | ႐
  | × | × |     A60F |          STerm | ꘏
  | ÷ | ÷ |     28F5 |          Other | ⣵
  | × | × |     1712 |         Extend | ᜒ
  | × | × |    104CF |          Upper | 𐓏
  | × | × |    11EF6 |         Extend | 𑻶
  | × | × |     0661 |        Numeric | ١
  | × | × |     759B |        OLetter | 疛
  | × | × |     0962 |         Extend | ॢ
  | × | × |     E76E |          Other | 
  | × | × |    13432 |         Format | 𓐲
  | × | × |     FF0E |          ATerm | .
  | × | × |     000A |             LF |

  | ÷ | ÷ |     2000 |             Sp |  
  | × | × |    16AF5 |          STerm | 𖫵
  | × | × |     1DE9 |         Extend | ᷩ
  | × | × |     2E04 |          Close | ⸄
  | × | × |     FF0C | SentenceBreak(14) | ,
  | × | × |     3000 |             Sp |  
  | × | × |     000D |             CR |
  | ÷ | ÷ |     FE63 | SentenceBreak(14) | ﹣
  | × | × |     206C |         Format | 
  | × | × |     000A |             LF |

  | ÷ | ÷ |     06F6 |        Numeric | ۶
  | × | × |     000D |             CR |
  | ÷ | ÷ |     FF0E |          ATerm | .
  | × | × |     3001 | SentenceBreak(14) | 、
  | × | × |    16F71 |         Extend | 𖽱
  | × | × |     0414 |          Upper | Д
  | × | × |     FD3E |          Close | ﴾
  | × | × |     0665 |        Numeric | ٥
  | × | × |     FF09 |          Close | )
  | × | × |     0027 |          Close | '
  | × | × |    12870 |          Other | 𒡰
  | × | × |     000D |             CR |
  | ÷ | ÷ |     0ED8 |        Numeric | ໘
  | × | × |     000D |             CR |
  | ÷ | ÷ |    115CB |          STerm | 𑗋
  | × | × |     1033 |         Extend | ဳ
  | × | × |     A950 |         Extend | ꥐ
  | × | × |     2063 |         Format | ⁣
😭| ÷ | × |     300F |          Close | 』
😭| × | ÷ |     773F |        OLetter | 眿
  | × | × |     0009 |             Sp |
  | × | × |     2029 |            Sep | 

  | ÷ | ÷ |     0890 |         Format | ࢐
  | × | × |     2002 |             Sp |  
  | × | × |     FE0A |         Extend | ︊
  | × | × |     2C8A |          Upper | Ⲋ
  | × | × |     09ED |        Numeric | ৭
  | × | × |     FE5D |          Close | ﹝
  | × | × |     2029 |            Sep | 

  | ÷ | ÷ |    15234 |          Other | 𕈴
  | × | × |     2E3C |          STerm | ⸼
  | ÷ | ÷ |     A9D7 |        Numeric | ꧗
  | × | × |     061C |         Format | ؜
  | × | × |     000A |             LF |

  | ÷ | ÷ |     0496 |          Upper | Җ
  | × | × |     002E |          ATerm | .
  | × | × |     3000 |             Sp |  
  | ÷ | ÷ |     4130 |        OLetter | 䄰
  | × | × |     2008 |             Sp |  
  | × | × |     002E |          ATerm | .
  | × | × |     1A86 |        Numeric | ᪆
  | × | × |     000D |             CR |
  | ÷ | ÷ |    112A9 |          STerm | 𑊩
  | ÷ | ÷ |     5F2F |        OLetter | 弯
  | × | × |     0029 |          Close | )
  | × | × |     2024 |          ATerm | ․
  | ÷ | ÷ |     1F2E |          Upper | Ἦ
  | × | × |    12CB4 |          Other | 𒲴
  | × | × |    118AC |          Upper | 𑢬
  | × | × |     24D2 |          Lower | ⓒ
  | × | × |     000D |             CR |
  | ÷ | ÷ |    1D402 |          Upper | 𝐂
  | × | × |    11BBA |          Other | 𑮺
  | × | × |     2028 |            Sep | 

  | ÷ | ÷ |     2C99 |          Lower | ⲙ
  | × | × |     0C66 |        Numeric | ౦
  | × | × |    11048 |          STerm | 𑁈
  | × | × |    10A57 |          STerm | 𐩗
  | × | × |     FF0E |          ATerm | .
  | × | × |     0456 |          Lower | і
  | × | × |     FF0E |          ATerm | .
  | × | × |     2000 |             Sp |  
  | × | × |     205F |             Sp |  
  | × | × |     202A |         Format | ‪
  | ÷ | ÷ |    1516B |          Other | 𕅫
  | × | × |     3001 | SentenceBreak(14) | 、
  | × | × |    111D3 |        Numeric | 𑇓
  | × | × |     EEEC |          Other | 
  | × | × |     1045 |        Numeric | ၅
  | × | × |     0085 |            Sep |
  | ÷ | ÷ |    1BCA0 |         Format | 𛲠
  | × | × |     A9B9 |         Extend | ꦹ
  | × | × |    119D3 |         Extend | 𑧓
  | × | × |     5F09 |        OLetter | 弉
  | × | × |     20DD |         Extend | ⃝
  | × | × |     0085 |            Sep |
  | ÷ | ÷ |     2C03 |          Upper | Ⰳ
  | × | × |     00AD |         Format | ­
  | × | × |    158F0 |          Other | 𕣰
  | × | × |     2C85 |          Lower | ⲅ
  | × | × |    1F678 |          Close | 🙸
  | × | × |     FE52 |          ATerm | ﹒
😭| ÷ | × |     E7D3 |          Other | 
  | × | × |     27EB |          Close | ⟫
  | × | × |    16A65 |        Numeric | 𖩥
  | × | × |     2008 |             Sp |  
  | × | × |    11EC1 |          Other | 𑻁
  | × | × |     FE58 | SentenceBreak(14) | ﹘
  | × | × |     2000 |             Sp |  
  | × | × |    11956 |        Numeric | 𑥖
  | × | × |    15B30 |          Other | 𕬰
  | × | × |    10B77 |          Other | 𐭷
  | × | × |     2C7B |          Lower | ⱻ
  | × | × |     2007 |             Sp |  
  | × | × |    1D179 |         Format | 𝅹
  | × | × |     1808 | SentenceBreak(14) | ᠈
  | × | × |     205F |             Sp |  
  | × | × |     002E |          ATerm | .
  | × | × |     0DE6 |        Numeric | ෦
  | × | × |     9B18 |        OLetter | 鬘
  | × | × |     2024 |          ATerm | ․
  | × | × |     000A |             LF |

  | ÷ | ÷ |     0E52 |        Numeric | ๒
  | × | × |     2006 |             Sp |  
  | × | × |     FE11 | SentenceBreak(14) | ︑
  | × | × |     0FB2 |         Extend | ྲ
  | × | × |     00AD |         Format | ­
  | × | × |     1EFB |          Lower | ỻ
  | × | × |     0085 |            Sep |
  | ÷ | ÷ |     1261 |        OLetter | ቡ
  | × | × |     2024 |          ATerm | ․
  | ÷ | ÷ |     5437 |        OLetter | 吷
  | × | × |     2024 |          ATerm | ․
  | × | × |     000A |             LF |

  | ÷ | ÷ |    1F176 |          Upper | 🅶
  | × | × |    118B1 |          Upper | 𑢱
  | × | × |     000A |             LF |

  | ÷ | ÷ |     FE50 | SentenceBreak(14) | ﹐
  | × | × |     2007 |             Sp |  
  | × | × |     FE13 | SentenceBreak(14) | ︓
  | × | × |    150E6 |          Other | 𕃦
  | × | × |     2005 |             Sp |  
  | × | × |    13439 |         Format | 𓐹
  | × | × |    1D46C |          Upper | 𝑬
  | × | × |     8E9F |        OLetter | 躟
  | × | × |     A8D2 |        Numeric | ꣒
  | × | × |     FE10 | SentenceBreak(14) | ︐
  | × | × |    1DF11 |          Lower | 𝼑
  | × | × |    1E947 |         Extend | 𞥇
  | × | × |     10A8 |          Upper | Ⴈ
  | × | × |    13F97 |          Other | 𓾗
  | × | × |     2E98 |          Other | ⺘
  | × | × |    1FBF7 |        Numeric | 🯷
  | × | × |     0C67 |        Numeric | ౧
  | × | × |    12E45 |          Other | 𒹅
  | × | × |     491F |        OLetter | 䤟
  | × | × |     2002 |             Sp |  
  | × | × |     9C12 |        OLetter | 鰒
  | × | × |     002E |          ATerm | .
  | × | × |     A92D |         Extend | ꤭
  | × | × |     0085 |            Sep |
  | ÷ | ÷ |     1808 | SentenceBreak(14) | ᠈
  | × | × |     5CB0 |        OLetter | 岰
  | × | × |     08E2 |         Format | ࣢
  | × | × |    1D6AA |          Upper | 𝚪
  | × | × |     23DC |          Other | ⏜
  | × | × |    10CE8 |          Lower | 𐳨
  | × | × |     00AB |          Close | «
  | × | × |    115C3 |          STerm | 𑗃
  | × | × |     2029 |            Sep | 

  | ÷ | ÷ |    10449 |          Lower | 𐑉
  | × | × |     0F9F |         Extend | ྟ
  | × | × |     04EA |          Upper | Ӫ
  | × | × |    1D7B3 |          Lower | 𝞳
  | × | × |    115CA |          STerm | 𑗊
  | × | × |     002C | SentenceBreak(14) | ,
  | × | × |     03F9 |          Upper | Ϲ
  | × | × |    10F56 |          STerm | 𐽖
  | × | × |     0670 |         Extend | ٰ
  | ÷ | ÷ |     5BC5 |        OLetter | 寅
  | × | × |     FE32 | SentenceBreak(14) | ︲
  | × | × |     FE32 | SentenceBreak(14) | ︲
  | × | × |     7EEA |        OLetter | 绪
  | × | × |     A68B |          Lower | ꚋ
  | × | × |     09D6 |          Other | ৖
  | × | × |     07AF |         Extend | ޯ
  | × | × |     2002 |             Sp |  
  | × | × |    11B79 |          Other | 𑭹
  | × | × |     2CB4 |          Upper | Ⲵ
  | × | × |    1D6BC |          Upper | 𝚼
  | × | × |     00DE |          Upper | Þ
  | × | × |    1498B |          Other | 𔦋
  | × | × |     2E00 |          Close | ⸀
  | × | × |     2067 |         Format | ⁧
  | × | × |    104A4 |        Numeric | 𐒤
  | × | × |     206B |         Format | 
  | × | × |     2308 |          Close | ⌈
  | × | × |     76CB |        OLetter | 盋
  | × | × |    10413 |          Upper | 𐐓
  | × | × |     200A |             Sp |  
  | × | × |    115E5 |          Other | 𑗥
  | × | × |     002E |          ATerm | .
  | × | × |     FF64 | SentenceBreak(14) | 、
  | × | × |    16A61 |        Numeric | 𖩡
  | × | × |     276C |          Close | ❬
  | × | × |     2001 |             Sp |  
  | × | × |    1D76A |          Upper | 𝝪
  | × | × |     000A |             LF |

  | ÷ | ÷ |     3000 |             Sp |  
  | × | × |    1D5A1 |          Upper | 𝖡
  | × | × |     FE11 | SentenceBreak(14) | ︑
  | × | × |     774F |        OLetter | 睏
  | × | × |     2024 |          ATerm | ․
  | × | × |    1BC9D |         Extend | 𛲝
  | × | × |     1934 |         Extend | ᤴ
  | ÷ | ÷ |    1D6F9 |          Upper | 𝛹
  | × | × |    11136 |        Numeric | 𑄶
  | × | × |     1BA7 |         Extend | ᮧ
  | × | × |     002E |          ATerm | .
😭| ÷ | × |     0029 |          Close | )
😭| × | ÷ |     9616 |        OLetter | 阖
  | × | × |    1D4D2 |          Upper | 𝓒
  | × | × |    1F15E |          Upper | 🅞
  | × | × |     002E |          ATerm | .
  | ÷ | ÷ |     228B |          Other | ⊋
  | × | × |     2000 |             Sp |  
  | × | × |     1C7F |          STerm | ᱿
  | × | × |     07F8 | SentenceBreak(14) | ߸
  | × | × |     FF0E |          ATerm | .
  | ÷ | ÷ |     96EF |        OLetter | 雯
  | × | × |     2009 |             Sp |  
  | × | × |     FF64 | SentenceBreak(14) | 、
  | × | × |    1E05F |          Lower | 𞁟
  | × | × |     301D |          Close | 〝
  | × | × |     000D |             CR |
  | ÷ | ÷ |     0890 |         Format | ࢐
  | × | × |    114D5 |        Numeric | 𑓕
  | × | × |    11DA8 |        Numeric | 𑶨
  | × | × |     2DEF |         Extend | ⷯ
  | × | × |     2024 |          ATerm | ․
  | × | × |     FFFB |         Format | 
  | × | × |    118D7 |          Lower | 𑣗
  | × | × |    1DAA6 |         Extend | 𝪦
  | × | × |     E051 |          Other | 
  | × | × |     1802 | SentenceBreak(14) | ᠂
  | × | × |     0CC4 |         Extend | ೄ
  | × | × |     06D4 |          STerm | ۔
😭| ÷ | × |     FE3A |          Close | ︺
😭| × | ÷ |     67DB |        OLetter | 柛
  | × | × |     2014 | SentenceBreak(14) | —
  | × | × |     129E |        OLetter | ኞ
  | × | × |     2024 |          ATerm | ․
  | × | × |     00A0 |             Sp |  
  | × | × |     2029 |            Sep | 

  | ÷ | ÷ |    11DA6 |        Numeric | 𑶦
  | × | × |     2028 |            Sep | 

  | ÷ | ÷ |     301A |          Close | 〚
  | × | × |    115D1 |          STerm | 𑗑
  | × | × |    16A6E |          STerm | 𖩮
  | × | × |     A69F |         Extend | ꚟ
  | × | × |    11F36 |         Extend | 𑼶
  | ÷ | ÷ |     F866 |          Other | 
  | × | × |     2029 |            Sep | 

  | ÷ | ÷ |    1144B |          STerm | 𑑋
  | × | × |     055D | SentenceBreak(14) | ՝
  | × | × |     002E |          ATerm | .
  | × | × |    1D174 |         Format | 𝅴
  | ÷ | ÷ |    15C6C |          Other | 𕱬
  | × | × |     0241 |          Upper | Ɂ
  | × | × |     A8CF |          STerm | ꣏
  | × | × |     FE51 | SentenceBreak(14) | ﹑
  | × | × |     7188 |        OLetter | 熈
  | × | × |     2028 |            Sep | 

  | ÷ | ÷ |     2007 |             Sp |  
  | × | × |     016E |          Upper | Ů
  | × | × |     9047 |        OLetter | 遇
  | × | × |     891E |        OLetter | 褞
  | × | × |     0208 |          Upper | Ȉ
  | × | × |     298E |          Close | ⦎
  | × | × |     206F |         Format | 
  | × | × |    11A34 |         Extend | 𑨴
  | × | × |     2C40 |          Lower | ⱀ
  | × | × |     336E |          Other | ㍮
  | × | × |     0020 |             Sp |
  | × | × |     1043 |        Numeric | ၃
  | × | × |     0603 |         Format | ؃
  | × | × |     2006 |             Sp |  
  | × | × |     1E04 |          Upper | Ḅ
  | × | × |     2024 |          ATerm | ․
  | × | × |     AA5F |          STerm | ꩟
  | ÷ | ÷ |    16B55 |        Numeric | 𖭕
  | × | × |     2028 |            Sep | 

  | ÷ | ÷ |     096F |        Numeric | ९
  | × | × |     000D |             CR |
  | ÷ | ÷ |     A92F |          STerm | ꤯
  | ÷ | ÷ |    1E2F5 |        Numeric | 𞋵
  | × | × |    11237 |         Extend | 𑈷
  | × | × |    111B5 |         Extend | 𑆵
  | × | × |    110BF |          STerm | 𑂿
  | × | × |    10F89 |          STerm | 𐾉
  | × | × |     203C |          STerm | ‼
  | ÷ | ÷ |    11DA5 |        Numeric | 𑶥
  | × | × |     200F |         Format | ‏
  | × | × |     2028 |            Sep | 

  | ÷ | ÷ |     1DF5 |         Extend | ᷵
  | × | × |    104B6 |          Upper | 𐒶
  | × | × |    16A60 |        Numeric | 𖩠
  | × | × |     2006 |             Sp |  
  | × | × |     104B |          STerm | ။
  | × | × |     FE52 |          ATerm | ﹒
  | × | × |     FE52 |          ATerm | ﹒
  | × | × |     2069 |         Format | ⁩
  | × | × |     3001 | SentenceBreak(14) | 、
  | × | × |     1680 |             Sp |  
  | × | × |     FE10 | SentenceBreak(14) | ︐
  | × | × |     2018 |          Close | ‘
  | × | × |     203D |          STerm | ‽
  | ÷ | ÷ |     0665 |        Numeric | ٥
  | × | × |     A692 |          Upper | Ꚓ
  | × | × |     07F8 | SentenceBreak(14) | ߸
  | × | × |     3000 |             Sp |  
  | × | × |     2028 |            Sep | 

  | ÷ | ÷ |     44CB |        OLetter | 䓋
  | × | × |     2024 |          ATerm | ․
  | × | × |     002E |          ATerm | .
  | × | × |     202F |             Sp |  
  | × | × |     1F27 |          Lower | ἧ
  | × | × |     EF0E |          Other | 
  | × | × |     000A |             LF |

  | ÷ | ÷ |     104B |          STerm | ။
  | × | × |     FF0E |          ATerm | .
aethanyc commented 1 year ago

cc @makotokato

eggrobin commented 1 year ago

(FYI It seems I have made progress on a fix.)