searching "activitypub" on mastodon will return irrelevant message results from flipboard.com

filippodb commented 3 months ago

Steps to reproduce the problem

1.search: "activitypub"

Expected behaviour

A full list of english articles about activitypub

Actual behaviour

random unrelated messages from flipboard.com

Detailed description

on mastodon.uno & mastodon.social when searching:

activitypub language:EN

it shows countless unrelated posts from flipboard.com!!

Mastodon instance

Mastodon.social

Mastodon version

v4.3

Browser name and version

Brave

Operating system

linux Fedora

Technical details

it happen also searching on other languages:

activitypub language:IT

renchap commented 3 months ago

Thanks for your report, I have been able to reproduce it. I suspect this comes from the link included in the message, as it contains "activitypub". Maybe our tokeniser when indexing into ES is splitting the URLs into separate tokens?

For example such a link is https://techspot.com/news/102598-avast-free-antivirus-testing-features-learning-about-six.html?utm_source=flipboard&utm_medium=activitypub, and Flipboard adds the UTM parameters to every link.

jasonculverhouse commented 3 months ago

I think that you are going to have to strip urls from plain text if you don't want them to be stemmed

Note that this will also happen if you search for amp. You will end up with every result that has more than one query parameter as they are encoded as & in the text. The standard tokenizer is going to index all of these under amp

https://github.com/mastodon/mastodon/blob/3a7aec2807089a004db90851c66db0a007a18a48/app/chewy/statuses_index.rb/#L30-L41

I would think that one could remove the URL's from the searchable_text that is indexed in the :stemmed field.

    field(:text, type: 'text', analyzer: 'verbatim', value: ->(status) { status.searchable_text }) { field(:stemmed, type: 'text', analyzer: 'content') }

There is also a https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html

Strips HTML elements from a text and replaces HTML entities with their decoded value (e.g, replaces & with &).

Might help?

mastodon / mastodon