superseriousbusiness / gotosocial

Fast, fun, small ActivityPub server.
https://docs.gotosocial.org
GNU Affero General Public License v3.0
3.66k stars 309 forks source link

[bug] Filters not always filtering (boosts) #3128

Open mirabilos opened 1 month ago

mirabilos commented 1 month ago

Describe the bug with a clear and concise description of what the bug is.

I’m seeing posts containing a filtered hashtag in FediText and Phanpy, but not in Semaphore. Vyr asked me to report it as server issue. Details below.

What's your GoToSocial Version?

0.16.0-rc3+git-db80361 🦥

GoToSocial Arch

amd64 binary

What happened?

No response

What you expected to happen?

No response

How to reproduce it?

No response

Anything else we need to know?

The post in question is:

gotosocial=> \pset null '\\N'
Null display is "\N".
gotosocial=> \pset pager 0
Pager usage is off.
gotosocial=> SELECT * FROM statuses WHERE id='01J3AEHK78GWZB0F6SDQFY0SFE';
             id             |       created_at       |          updated_at           |                                  uri                                  |                           url                           |                                                                                                        content                                                                                                         |         attachments          |             tags             | mentions | emojis | local |         account_id         |                account_uri                | in_reply_to_id | in_reply_to_uri | in_reply_to_account_id | boost_of_id | boost_of_account_id | content_warning | visibility | sensitive | language | created_with_application_id | activity_streams_type | text | federated | boostable | replyable | likeable | pinned_at |          fetched_at           | poll_id | thread_id 
----------------------------+------------------------+-------------------------------+-----------------------------------------------------------------------+---------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------+------------------------------+----------+--------+-------+----------------------------+-------------------------------------------+----------------+-----------------+------------------------+-------------+---------------------+-----------------+------------+-----------+----------+-----------------------------+-----------------------+------+-----------+-----------+-----------+----------+-----------+-------------------------------+---------+-----------
 01J3AEHK78GWZB0F6SDQFY0SFE | 2024-07-21 10:49:05+00 | 2024-07-21 21:47:42.978644+00 | https://dresden.network/users/AusderPampa/statuses/112824087059235666 | https://dresden.network/@AusderPampa/112824087059235666 | <p>Der Hund weiß wie Erfrischung geht.</p><p><a href="https://dresden.network/tags/dogsofmastodon" class="mention hashtag" rel="tag nofollow noreferrer noopener" target="_blank">#<span>dogsofmastodon</span></a></p> | {01J3B0G5HJNZTNV00BZ512FT04} | {01H7XJZY83AR4H0A2V3RCH9SXF} | {}       | {}     | f     | 01CZYDR081QN3ZJ2HYW2YB2E0A | https://dresden.network/users/AusderPampa | \N             | \N              | \N                     | \N          | \N                  | \N              | public     | f         | de       | \N                          | Note                  |      | t         | t         | t         | t        | \N        | 2024-07-21 21:47:42.978637+00 | \N      | \N
(1 row)

I’m not following the poster directly, I got it as boost into my home timeline:

gotosocial=> SELECT * FROM statuses WHERE boost_of_id='01J3AEHK78GWZB0F6SDQFY0SFE';
             id             |       created_at       |       updated_at       |                                       uri                                       | url |                                                                                                        content                                                                                                         | attachments | tags | mentions | emojis | local |         account_id         |                account_uri                 | in_reply_to_id | in_reply_to_uri | in_reply_to_account_id |        boost_of_id         |    boost_of_account_id     | content_warning | visibility | sensitive | language | created_with_application_id | activity_streams_type | text | federated | boostable | replyable | likeable | pinned_at | fetched_at | poll_id | thread_id 
----------------------------+------------------------+------------------------+---------------------------------------------------------------------------------+-----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+------+----------+--------+-------+----------------------------+--------------------------------------------+----------------+-----------------+------------------------+----------------------------+----------------------------+-----------------+------------+-----------+----------+-----------------------------+-----------------------+------+-----------+-----------+-----------+----------+-----------+------------+---------+-----------
 01J3B0G2184SYD017E929YAX2Y | 2024-07-21 16:02:49+00 | 2024-07-21 16:02:49+00 | https://mastodon.social/users/WanabeSelkie/statuses/112825320680331834/activity | \N  | <p>Der Hund weiß wie Erfrischung geht.</p><p><a href="https://dresden.network/tags/dogsofmastodon" class="mention hashtag" rel="tag nofollow noreferrer noopener" target="_blank">#<span>dogsofmastodon</span></a></p> | {}          | {}   | {}       | {}     | f     | 01H8ARHCYFZE29XWTM3PSN2VQ7 | https://mastodon.social/users/WanabeSelkie | \N             | \N              | \N                     | 01J3AEHK78GWZB0F6SDQFY0SFE | 01CZYDR081QN3ZJ2HYW2YB2E0A | \N              | public     | f         | de       | \N                          | Note                  |      | t         | t         | t         | t        | \N        | \N         | \N      | \N
(1 row)

I do follow the person who boosted it.

I’ve set a filter for the whole word on the home timeline only.

Semaphore filter setup dialogue

gotosocial=> SELECT * FROM filters WHERE title='#dogsofmastodon';
             id             |          created_at           |          updated_at           | expires_at |         account_id         |      title      | action | context_home | context_notifications | context_public | context_thread | context_account 
----------------------------+-------------------------------+-------------------------------+------------+----------------------------+-----------------+--------+--------------+-----------------------+----------------+----------------+-----------------
 01HTFR665X440K76C9MZZXQVXM | 2024-04-02 15:52:32.701876+00 | 2024-04-02 15:52:32.701876+00 | \N         | 01GS55SW7VWYN3BFRYCX5NW526 | #dogsofmastodon | warn   | t            | f                     | f              | f              | f
(1 row)

I do know that this kind of filter usually works: I follow a person directly who posts cat pictures and sometimes dog pictures, and when I go to their account profile in Semaphore and FediText, I see the dog-related posts, but they do not show up on the home timeline in either client, so I assume it has to do with the post being boosted while I don’t follow the authoring account.

This reminds me of https://github.com/superseriousbusiness/gotosocial/issues/2911 which was pretty much the same but for v1 filters, same effect (boosted posts visible in FediText but not Semaphore), so FediText and GotoSocial probably have the same or a related bug (probable since it’s pretty much been written by the same developer (who does amazing work, thank you Vyr)).

Another thing I’ve noted (from v1 filters) is that I have to put the entire hashtag with the octothorpe (#) into the filter, otherwise it tends to not filter out everything. I noticed the extra <span> behind it in the DB HTML, but that’s not a problem for the other posts. (As a user, I would expect filters to look at the equivalent of what python BeautifulSoup’s get_text() method, i.e. a concatenation of the content of all text nodes in the document, without any element or attribute nodes.)

VyrCossont commented 1 month ago

@mirabilos thanks for collecting all the details. Investigating.

VyrCossont commented 1 month ago

We can rule out whole-word filter anchoring and HTML to plain text conversion as possibilities, but I'll add the tests I wrote while investigating to enforce that filters starting with # work as expected.

It looked like the real problem might be the boost: Mastodon's API propagates the filtered value (list of matched filters and what keywords matched) or timeline removal of an original post to any boosts of that post, but GtS's only propagates the timeline removal, so boosts of posts that would have been shown with a filter warning (non-empty filtered) don't necessarily have a filtered themselves. Fortunately, this is a one-line change to match the Mastodon behavior. I'm also adding tests to ensure that filters match Mastodon API behavior for boosts.

However, here's where it gets weird:

  1. Boosts as stored in GtS's DB and served by the GtS API have a non-empty content, duplicating the original post's, and correspondingly have the same filtered as the original post anyway (outside of the aforementioned tests)! This happens even when the actual Announce activity doesn't have any content, which is the normal case. I'm not sure why we're doing this, since it duplicates information and takes up extra DB space, but it happens in EnrichAnnounce (remote) and StatusToBoost (local). Mastodon and Akkoma do not appear to do this; don't know about other implementations.
  2. Phanpy is another filters v2 client, and Phanpy does work as expected, successfully filtering posts boosted into my home timeline.

So given that (1) we're sending the right information by accident and (2) Phanpy works but Feditext doesn't, what we have here is probably actually a Feditext bug, although it's uncovered some surprising GtS behavior.

mirabilos commented 1 month ago

Vyr Cossont dixit:

  1. Phanpy is another filters v2 client, and Phanpy does work as expected, successfully filtering posts boosted into my home timeline.

Interesting, because it doesn’t do that for me.

I created the filter using the v1 API, when it became first available, in case that’s relevant.

VyrCossont commented 1 month ago

Hmm. I don't have an explanation for that. v1 filters have always been represented as v2 filters internally to GtS, so I don't think that part is relevant.

VyrCossont commented 1 month ago

@mirabilos I have not been able to replicate this on main. Do you have any other examples you can share?

mirabilos commented 1 month ago

Vyr Cossont dixit:

@mirabilos I have not been able to replicate this on main. Do you @have any other examples you can share?

Not right now… but you were able to reproduce it on 0.16?

Should I upgrade to main?

I guess I’ll need to retest, anyway… it MIGHT also have been fixed in latest FediText, for some reason?

VyrCossont commented 1 month ago

I'm pretty sure the Feditext bug was the sole cause of the behavior you experienced in Feditext. I'm less sure about what you reported with Phanpy — it's definitely not filtering boosts?

mirabilos commented 1 month ago

Vyr Cossont dixit:

I'm less sure about what you reported with Phanpy — it's definitely not filtering boosts?

Unsure about all of them, but I saw the post in question containing the filtered word in Phanpy.

mirabilos commented 1 month ago

Hmm. The current status (with GtS 0.16.0-rc3 and latest FediText) is that, of boosted posts where I don’t follow the original poster, those without # (e.g. I blocked just the word Trump) are reliably filtered out (and shown with that stricken-through eye) while those I put filters on with # (e.g. #dogs) are not filtered.

mirabilos commented 3 weeks ago

OK, now it gets crazy.

Still on 0.16.0-rc3, on the home timeline, I see several posts tagged #TLDI despite having that filtered, both from accounts I directly follow (this is new) and from boosts (this is not new). In the middle of them, there is one of them marked with the slashed-through eye, which is not different from the others… a line of text, four or so hashtags.

It appears that filtering does not always work reliably.

In Semaphore, none of the posts are showing.