tedivm / fedimapper

An API for the Fediverse - The Software behind the Fediverse Almanac
https://www.fediversealmanac.com
MIT License
16 stars 2 forks source link

Scrapper should be opt-in #7

Closed l3gacyb3ta closed 1 year ago

l3gacyb3ta commented 1 year ago

The fediverse has a unique culture towards scrappers, and I believe that you should probably make this opt-in, if you don't wanna start getting bashed all over fedi.

nahga commented 1 year ago

This is a bad idea overall. Nominating yourself to curate and collect data that no one asked you to do will probably not fly well.

tedivm commented 1 year ago

I actually was asked about this. This project started because of an abuse issue multiple instances were having where their blocks were being evaded by proxy instances, so I wrote them a tool to help them identify those systems. I've worked with multiple instance admins throughout this process to ensure that user privacy is respected, and have made sure that this system can easily be opted out of by instances who don't want to be crawled.

I'm taking all feedback seriously and have made several changes based on that feedback. This tool is meant to help instance admins out, but if it causes damage in the process it'll get taken down.

tedivm commented 1 year ago

Here's some context on the original abuse issue that caused this project to get formed.

nahga commented 1 year ago

Regardless of intention, all of this should be opt-in by default. Not the other way around.

l3gacyb3ta commented 1 year ago

Yeah, generally people hate these tools, even if it was made with good/neutral intentions. I also am very wary of tools like this.

mothdotmonster commented 1 year ago

+1 on making things be opt-in.

Freeplayg commented 1 year ago

PLEASE. Opt-in.

Instances have already blocked hachyderm for this.

VyrCossont commented 1 year ago

I actually was asked about this. This project started because of an abuse issue multiple instances were having where their blocks were being evaded by proxy instances, so I wrote them a tool to help them identify those systems.

@tedivm If opting out of this tool does anything meaningful, what happens when the AP proxy software starts opting out? Conversely, why collect so much info not relevant to identifying proxies by counting subdomains?

l3gacyb3ta commented 1 year ago

Using your own service, we can see mastinator.com is blocked a ton for the exact things you are trying to do (scraping, violating consent, not going with the culture)

Seirdy commented 1 year ago

Like the KF scraper, this publicizes which instances block a given instance; doing so makes that data much more easily accessible than it was before. This encourages targeted retaliation, and thus needs to be opt-in. Your good intentions don't erase this problem and the real harm it causes.

I can believe you when you say this was meant to address real issues around blocklist-evasion, but the data it currently exposes will merely replace one threat with another. I suggest deleting existing data, turning it off, and asking Fedi for feedback before starting a project like this again. I do think it's possible to publish very limited aggregate data that doesn't enable targeted harassment, but this isn't how it's done.

bgcarlisle commented 1 year ago

You NEED to make this opt-in only

Your opt-out method is also inadequate

Not everyone who's running a Masto instance can alter their robots.txt, thanks to limitations from hosting providers

w3bb commented 1 year ago

Not everyone who's running a Masto instance can alter their robots.txt, thanks to limitations from hosting providers

So then complain to the hosting providers. Why is it everybody else's problem?

l3gacyb3ta commented 1 year ago

So then complain to the hosting providers. Why is it everybody else's problem?

Because everyone else is writing scrapers...

tedivm commented 1 year ago

Like the KF scraper, this publicizes which instances block a given instance; doing so makes that data much more easily accessible than it was before. This encourages targeted retaliation, and thus needs to be opt-in. Your good intentions don't erase this problem and the real harm it causes.

I have removed those endpoints from the service.

w3bb commented 1 year ago

Because everyone else is writing scrapers...

There is a social contract on the internet that the site owner puts a robots.txt file at the root to dictate what bots do. The internet (and even law in some cases) are built around this contract. This is how the internet has worked for ages. You have a means to not have /any/ bot spider and get information.

It takes one line in robots.txt, or a blocking of the user agent, or a flip of a switch to not advertise the information. I believe the last one should be possible on a shared host running Mastodon. I'd advise the latter if you're concerned about this information being public. There are people who do not care about robots.txt and will get the information if its avaliable, and security through hoping-no-bad-people-will-ever-collect-this-information-I-plaster-all-over-the-place is a bad model.

l3gacyb3ta commented 1 year ago

There is a social contract on the internet that the site owner puts a robots.txt file at the root to dictate what bots do.

There's also a social contest on fedi not to build scapers lol

w3bb commented 1 year ago

There's also a social contest on fedi not to build [spiders] lol

I don't think so. There are very popular sites like fediverse.observer that spiders instances and collect similar information. The sentiment that something like this is bad is not a common one in my experience; I think that it's a vocal minority.

l3gacyb3ta commented 1 year ago

Well then we can disagree, but this thread suggests otherwise

bgcarlisle commented 1 year ago

The fact remains, I'm an instance admin and I CAN'T opt out

This thing NEEDS to be shut down until it is opt-in only

w3bb commented 1 year ago

I'll also add that nodeinfo is designed for tools like this.

NodeInfo is an effort to create a standardized way of exposing metadata about a server running one of the distributed social networks. The two key goals are being able to get better insights into the user base of distributed social networking and the ability to build tools that allow users to choose the best fitting software and server for their needs.

w3bb commented 1 year ago

The fact remains, I'm an instance admin and I CAN'T opt out

This thing NEEDS to be shut down until it is opt-in only

Ways of doing so, even under shared hosting, are possible as I explained earlier.

Seirdy commented 1 year ago

Since the creator's fedi account has been suspended, they might not have seen my reply so I'll copy it here since it's relevant:

Listen. People do not trust you right now. You potentially hold a ton of data and people feel unsafe merely knowing you have it. Tons of people have archived the data you exposed already and are literally going through it right now.

You need to over-correct fast, and that means shutting this down.

ThatOneCalculator commented 1 year ago

Opt in or get out.

w3bb commented 1 year ago

(This is the documentation: https://github.com/tedivm/fedimapper#block-this-bot)

I'm referring to what I said about not exposing the blocklist itself. I believe that is an option. On mastodon.social you can also see obscured domain names on the about page as well.

w3bb commented 1 year ago

Opt in or get out.

robots.txt allowing bots is opting in. This project respects robots.txt.

bgcarlisle commented 1 year ago

No, that's opt out, literally the opposite, and the documentation doesn't give any other options than using robots.txt

w3bb commented 1 year ago

No, that's opt out, literally the opposite, and the documentation doesn't give any other options than using robots.txt

It's an implicit opt-in. I'd advise you to read my earlier comment.

Seirdy commented 1 year ago

On Mon, Jan 09, 2023 at 01:17:09PM -0800, webb wrote:

robots.txt allowing bots is opting in

@w3bb opting-out of opting-out is not the same as opting in, because consent isn't a freaking multiplication problem.

bgcarlisle commented 1 year ago

It's an implicit opt-in

The term for that is "opt out"

w3bb commented 1 year ago

@w3bb opting-out of opting-out is not the same as opting in, because consent isn't a freaking multiplication problem.

This is how the internet has worked for ages. If people had to get manual consent for spidering, search engines would have been impossible. People should be complaining to their hosts who can't spend fifteen minutes to add a basic option like that instead of spiders who have no reasonable means of knowing you can't use robots.txt.

Seirdy commented 1 year ago

On Mon, Jan 09, 2023 at 01:27:05PM -0800, webb wrote:

This is how the internet has worked for ages. If people had to get manual consent for spidering, search engines would have been impossible.

That's the point. There's a reason why just about every attempt at a Fediverse search engine has been network-filtered, Fediblocked, tarpitted, and fed bad data until it shut down. Tools like search engines which aren't opt-in aren't welcome on Fedi.

-- Seirdy (https://seirdy.one)

w3bb commented 1 year ago

That's the point. There's a reason why just about every attempt at a Fediverse search engine has been network-filtered, Fediblocked, tarpitted, and fed bad data until it shut down. Tools like search engines which aren't opt-in aren't welcome on Fedi.

This is a false equivalency. Like I mentioned earlier up in the thread (I assume it hasn't come through to you on email) a better comparison would be to something like fediverse.observer. This I believe uses a Mastodon endpoint, but the same information is exposed via nodeinfo, which is explicitly designed for tools like these. fediverse.observer also uses nodeinfo.

w3bb commented 1 year ago

(For some reason it sent it while I was typing a draft, sorry about that.)

tedivm commented 1 year ago

Just to be clear, I'm not an admin on Hachyderm. The people spreading that rumor are wrong. Hachyderm has nothing to do with this project other than me making a post there.

bgcarlisle commented 1 year ago

Did you read anything here?

That's not what anyone is discussing

tedivm commented 1 year ago

I'm reading everything that comes through, but I wanted to clear that one piece of information up.

ledlamp commented 1 year ago

Bruh, if it was opt-in, it would be useless cause nobody would bother to opt in! Imagine if you had to opt-in to bots on the world wide web; search engines like google wouldn't really work because a lot of sites don't care and don't have a robots.txt! Well, the fediverse is based on the world wide web, so the same concept applies.

tedivm commented 1 year ago

The website is down. Thank you all for your feedback.