marsara9 / lemmy-search

An enhanced search engine just for Lemmy/Fediverse
https://www.search-lemmy.com
GNU Affero General Public License v3.0
81 stars 4 forks source link

Make this opt-in #9

Open mxamber opened 1 year ago

mxamber commented 1 year ago

Every few weeks, someone else has the glorious idea of indexing the entire fediverse, and every few weeks, we have to debate the issue over again.

Lack of searchability is a feature for many people (and in fact intended as such in Mastodon and its derivatives). People are migrating to the fediverse to escape corporate ecosystems where their data is harvested all the time, and for many people, lack of global search is also an anti-harassment feature, to prevent the Twitter-esque harassment wherein people will search for marginalised communities to torpedo their activism or just plain survival by means of trolling, doxxing, etc. Several people have tried to spin up search engines for the fediverse, and that has almost always ended with such instances blocked widely across the network (and many users don't care to differentiate between search engine and scraping when one can easily be used for the other).

Of course, if a server wants to be searchable, that's their decision to make (and it makes sense for Lemmy and Kbin, being Redditlikes). But it's not a decision anyone can make for the entirety of the network. On that grounds, any such mechanism absolutely has to operate on an opt-in basis.

marsara9 commented 1 year ago

Problem, because of the nature of the Fediverse, I can flag certain instances as not being indexable (i.e. robots.txt), but because those same posts may be shared with other instances that are searchable, there's nothing preventing those posts from still being able to be found. This would only prevent users from choosing a Mastodon server as their preferred server.

Even with the primitive search capability that exists in Lemmy today, I can already find posts that were made from Mastodon. So even today the only way to prevent Mastodon content from being indexed is to only have Mastodon servers federate with other Mastodon servers.

I'll leave this issue open for now, but I don't see anyway to prevent Mastodon content showing up here unless every Mastodon server chooses to only federate with other Mastodon servers.

marsara9 commented 1 year ago

I'll be testing this tonight but I did add: https://github.com/marsara9/lemmy-search/pull/10. So if a given instance sets lemmy-search's user-agent to be disabled then it won't crawl that website.

Now this doesn't prevent the content from being indexed still. It only prevents it from being indexed on that particular instance. So if A and B are federating with each other, where A is blocking searches but B isn't, then the crawler can still find the content on B and index it there. Now this will prevent said instance from showing up in the preferred-instance dropdown, so users will have to do some extra work if they want to open said link on A rather than B.