hypothesis / via

Proxies third-party PDF files and HTML pages with the Hypothesis client embedded, so you can annotate them
https://via.hypothes.is/
BSD 2-Clause "Simplified" License
19 stars 7 forks source link

Add an update-able blocklist file for Via HTML #239

Closed jon-betts closed 3 years ago

jon-betts commented 4 years ago

We probably need to replicate an approximation of the block list feature for Via HTML as we have for Via 1.

It's probably not going to be exactly the same, as video sites should no longer be a problem.

Some questions:

Notes

From a chat with Sean it sounds like the inclusion of a large list of porn sites etc might be as a part of loading a generic "bad sites" list which included lots of nefarious sites along with actively hostile ones. This likely served multiple purposes as porn sites are likely to have video content which caused us problems in the past.

Blocking based on content type seems close to an editorial decision we don't really want to make, or at least is likely a separate feature.

The aims of this tech are more likely:

robertknight commented 4 years ago

What is this list for? (Performance, porn blocker?)

Historically, there have been two major concerns:

Of the above, (1) was a far bigger issue. In particular users using Via to apparently circumvent firewalls to stream videos, play HTML games or, yes, stream porn. The existing blocklist in S3 was compiled from a combination of sites we ended up blocking manually via Cloudflare when they caused operational issues + taking a sample of web logs from Via and blocking the most frequently trafficked sites which were a) "not legitimate uses of an annotation proxy" and b)

e.g. Is it our place to choose what content people access through Via

Again, the current blocklist exists primarily for operational reasons, we're not worried about users annotating the naughty parts of Sons and Lovers.

Can we ever have any reasonable hope of keeping up with whatever we choose?

Since the blocker exists primarily to ensure uptime for legitimate users, rather than to prevent users accessing certain content for say, moral/censorship reasons, it is OK if we don't block all sites deemed "not legitimate use", as long as we can manage the traffic we have from an operational/cost perspective.

jon-betts commented 4 years ago

Just had a chat with @robertknight about the block list to cover off the motivation and maintenance. A quick summary of that chat:

Why we block

There are lots of legitimate reasons you might want a block list, but we only have one for a few:

The content we block

The content we block is therefore for a few different reasons

We generally block sites when they cause us a problem not speculatively, and there aren't many publisher requests. This means the file is relatively static, whilst retaining the ability to change it quickly.

The sites we've blocked for causing trouble in the past have often been pornography etc, but the critical part here is that we blocked them based on an operation issue, not an editorial one. If someone genuinely wants to annotate that material, we don't actually have an issue with that, we just can't always afford the speed and bandwidth to proxy it.

This has a few implications:

Our wording here is a bit vague and could do with some tightening up I think. Basically I think the difference is "You can't annotate this" and "Annotate all you like using the browser extension". We actually suggest the extension for both, so I guess I still have some outstanding questions about the publisher one

Implementation ideas

The blocklist is served from here: https://hypothesis-via.s3-us-west-1.amazonaws.com/via-blocklist.txt

There are details about updating it here: https://stackoverflow.com/c/hypothesis/questions/102/250#250

Pie in the sky dreaming

It would be amazing if we kept stats about what sites cost us the most bandwidth so we can be a bit more evidence based. This would fit really neatly in a graphite/graphana style world, but we don't currently have something like that.

It might be interesting to see if @indigobravo has any ideas about a cheap and cheesy way to collect this info without re-inventing the wheel and creating our own crap version of a monitoring and metrics system.

There are other stats that would probably be nice to know too.

jon-betts commented 3 years ago

Done! But the blocklist needs to be edited to add the new categories