superseriousbusiness / gotosocial

Fast, fun, small ActivityPub server.
https://docs.gotosocial.org
GNU Affero General Public License v3.0
3.81k stars 331 forks source link

[bug] Whole-word filters only work correctly with ASCII text #3299

Open VyrCossont opened 1 month ago

VyrCossont commented 1 month ago

Golang's regexp package documents \b as working only with ASCII text, which affects how our whole-word filters match.

UTR #18 has some guidance for this. We might be able to achieve what they call "Level 1" or "Level 2" word boundary support with comprehensive replacements for \b using the Unicode features that Go can match on. "Level 3" might be too much work:

Semantic analysis may be required for correct word-break in languages that don't require spaces, such as Thai, Japanese, Chinese or Korean. This can require fairly sophisticated support if Level 3 word boundary detection is required, and usually requires drawing on platform OS services.

Discovered while investigating #3128.

tsmethurst commented 1 month ago

Well that's unfortunate. Thanks for investigating + writing this up!