superseriousbusiness / gotosocial

Fast, fun, small ActivityPub server.
https://docs.gotosocial.org
GNU Affero General Public License v3.0
3.7k stars 316 forks source link

[bug] `SanitizeToPlaintext` concatenates text in unexpected ways #3298

Closed VyrCossont closed 1 week ago

VyrCossont commented 2 weeks ago

For example, <p>as</p><p>df</p> sanitizes to asdf. This would cause a false positive for a filter with the keyword asdf, and a false negative for a whole-word filter with the keyword df. I'd expect output more like as\ndf\n.

Likewise, <br> tags are dropped, not converted to \n. Guessing the same holds for <wbr>, and any other tags closely equivalent to characters. (Fun exercise: what would we expect from <hr>?)

Not familiar with the BlueMonday sanitizer we use, so not sure how hard this would be to fix.

Discovered while investigating #3128.

tsmethurst commented 1 week ago

Mmm, maybe instead of BlueMonday we can use some other method of converting HTML to plaintext for the purposes of matching against filters. In the frontend we now use https://www.npmjs.com/package/html-to-text for converting the HTML representation of statuses into text for showing in certain places. If there's a Go equivalent we could look at that, perhaps.

tsmethurst commented 1 week ago

Something like this -- https://pkg.go.dev/github.com/k3a/html2text -- or this -- https://github.com/jaytaylor/html2text -- perhaps? But then if these operations are expensive (and I'd imagine they're not as cheap as BlueMonday) we probably also want to be storing those results in the 'text' field of the *gtsmodel.Status model, ie., in the database. Not 100% sure.