superseriousbusiness / gotosocial

Fast, fun, small ActivityPub server.
https://docs.gotosocial.org
GNU Affero General Public License v3.0
3.82k stars 331 forks source link

[bugfix] Use better plaintext representation of status for filtering #3301

Closed tsmethurst closed 2 months ago

tsmethurst commented 2 months ago

Description

If this is a code change, please include a summary of what you've coded, and link to the issue(s) it closes/implements.

If this is a documentation change, please briefly describe what you've changed and why.

This pull request updates our filtering logic to not use our SanitizeToPlaintext function for reducing status HTML content to plaintext, but instead use https://github.com/k3a/html2text, which doesn't cause weird line concatenation, and can competently extract links, mentions, and hashtags properly from the text.

To avoid re-parsing a status from HTML every time we want to filter it, a TTLCache has been added to the converter which stores the parsed-to-text version of statuses.

Also some minor fixes to our filter regexes, to include whitespace and start/end line in our whole word match.

closes https://github.com/superseriousbusiness/gotosocial/issues/3298 closes https://github.com/superseriousbusiness/gotosocial/issues/3128

Checklist

Please put an x inside each checkbox to indicate that you've read and followed it: [ ] -> [x]

If this is a documentation change, only the first checkbox must be filled (you can delete the others if you want).