Blocking Onion Services based on "Similarity Matching" on already blocked ones

fpietrosanti commented 8 years ago

Considering the issues described at #151, this ticket is to propose a different approach to blocking "future onion services" based on an existing "blocking pattern" in the attempt to fight cryptolockers.

But what if, for each blocked site,there would be a dump by passing it to some hashing designed to work with "similarity matching" and then to be able to have an auto-blocks for web-pages that behave to be similar more than 90% to a previously blocked page?

Without looking into the context of the content, but only of similarity of the web page fingerprint/pattern of an already blocked web page, we could be able to block new web pages.

It does require an algorithm implemented by some existing ready-made library that gives out the similarity of a website, compared to another website.

If this exists, we would be able, once blocking one crypto-locker, to block all the cryptolocker landing page of the same campaign, without entering into an arm-race of regexp'ing stuff, they change something, we regexp other stuff, etc

I don't know how complex it would be, but i think it's neat!

@virgil @evilaliv3 @moba @juhanurmi

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/29594930-blocking-onion-services-based-on-similarity-matching-on-already-blocked-ones?utm_campaign=plugin&utm_content=tracker%2F318575&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F318575&utm_medium=issues&utm_source=github).

evilaliv3 commented 8 years ago

i like the idea but that would be some kinda magical! may @God code it while i leave for while riding my unicorn! nice fantasy @fpietrosanti!

from my closed mind i do not understand how you expect to do it in tor2web given that:

tor2web does now know the urls that are blocked but only the hash; ok you want to take action before we start hashing.
tor2web still have a dirty blocking possibility where oone can put a domain, a full url, or a path only and tor2web does not know this categorization but simply test against three different hashes everytime
your idea should be possibile only by spidering something of the website at time 1. and fetching some contents of the page visited before providing the content to the user. if i see the forst option not problematic, doing the second part in tor2web is for me totally unfeasible.

besos!

lastknight commented 8 years ago

Not entirely in accord with @evilaliv3: a similarity approach is easy to implement (and store).

See here for a couple of implementations: http://stackoverflow.com/questions/8897593/similarity-between-two-text-documents

During the first call on a blocked page is possible to calculate a text fingerprint and to compare it with a pre-approved set.

My 2€c.

evilaliv3 commented 8 years ago

@lastknight: tor2web works in streaming with a buffering of only 1k and a sliding window of 0.5k; i would not find feasible to keep a fingerprint for each 5k.

storing a fingerprint for each 5k served would require more of 2cents :)

tor2web / Tor2web

Blocking Onion Services based on "Similarity Matching" on already blocked ones #271