iscc / iscc-specs

ISCC: International Standard Content Code
http://iscc.codes
Other
47 stars 9 forks source link

Text normalization should not concatenate words separated by LF/CR #27

Closed titusz closed 5 years ago

titusz commented 6 years ago

Currently text_normalize("Hello\nWorld") yields HelloWorld. Line feed (LF) and carriage returns (CR) are filtered out because they are Unicode characters in the "Other, Control" (Cc) Category. Text normalization should preserve word boundaries with spaces.

See also: http://www.unicode.org/reports/tr29/tr29-29.html#Word_Boundaries 🙈