open-i18n / rust-unic

UNIC: Unicode and Internationalization Crates for Rust
https://crates.io/crates/unic
Other
234 stars 24 forks source link

WB3d: Keep horizontal whitespace together. #269

Closed scottmcm closed 5 years ago

scottmcm commented 5 years ago

From the examples in https://docs.rs/unic-segment/0.9.0/unic_segment/ it appears that this crate (like unicode-segmentation) treats the boundary between two spaces as a word bound:

assert_eq!(
    WordBounds::new("The quick (\"brown\")  fox").collect::<Vec<&str>>(),
    &["The", " ", "quick", " ", "(", "\"", "brown", "\"", ")", " ", " ", "fox"]
);

However WB3d says "Keep horizontal whitespace together.", with the rule "WSegSpace × WSegSpace". The test file https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/WordBreakTest.txt confirms that there should not be a break between sequential spaces:

÷ 0020 × 0020 ÷ #  ÷ [0.2] SPACE (WSegSpace) × [3.4] SPACE (WSegSpace) ÷ [0.3]

Is this a bug, or am I misunderstanding something?

scottmcm commented 5 years ago

Ah this is a dup of https://github.com/open-i18n/rust-unic/issues/259 because the rule showed up in the re-issue of UAX#29 for Unicode 11 (http://www.unicode.org/reports/tr29/tr29-33.html#Modifications).