superseriousbusiness / gotosocial

Fast, fun, small ActivityPub server.
https://docs.gotosocial.org
GNU Affero General Public License v3.0
3.69k stars 312 forks source link

[bug] Whole-word filters only work correctly with ASCII text #3299

Open VyrCossont opened 1 week ago

VyrCossont commented 1 week ago

Golang's regexp package documents \b as working only with ASCII text, which affects how our whole-word filters match.

UTR #18 has some guidance for this. We might be able to achieve what they call "Level 1" or "Level 2" word boundary support with comprehensive replacements for \b using the Unicode features that Go can match on. "Level 3" might be too much work:

Semantic analysis may be required for correct word-break in languages that don't require spaces, such as Thai, Japanese, Chinese or Korean. This can require fairly sophisticated support if Level 3 word boundary detection is required, and usually requires drawing on platform OS services.

Discovered while investigating #3128.

tsmethurst commented 6 days ago

Well that's unfortunate. Thanks for investigating + writing this up!