Support blacklisting based on "Regexp" to filter out bad-sites

fpietrosanti commented 10 years ago

It happened that some "cryptolocker" initiatives started using Tor2web, causing major issue in server stability due to takedown.

They often change Onion Address, making de-facto difficult to blacklist them

This ticket is to introduce supporto for blacklisting "Strings" or "Regexp" in order to filter out entire web page or entire web applications, regardless of the onion address.

Example of bad site to be filtered out based on strings rather than onion address "kpai7ycr7jxqkilp.onion"

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/3764136-support-blacklisting-based-on-regexp-to-filter-out-bad-sites?utm_campaign=plugin&utm_content=tracker%2F318575&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F318575&utm_medium=issues&utm_source=github).

fpietrosanti commented 9 years ago

That could be part of #67

evilaliv3 commented 9 years ago

ok as i think this is strongly needed nowadays, I will readly implement this (eventually by using my joburg-italy flight to do it :)

fpietrosanti commented 9 years ago

It would be good to have support for a dedicated block-page given a specific regexp.

That way we could do a filter to block cryptolocker, with a dedicated cryptolocker help/support-page.

But we could also in future block .exe downloads by giving a specific page of instructions on why and how it's safe to download stuff from untrusted sources, etc, etc

hellais commented 9 years ago

I think that starting to do filtering based on content is a very dangerous rabbit hole. It's one this to block specific URLs when we receive a takedown notice, but doing pre-emptive blocking based on content is something I don't consider good. It's basically doing the job of a very advanced censor. I am strongly against this feature.

fpietrosanti commented 9 years ago

@hellais there is no way to avoid that, cryptolocker is causing takedown of all tor2web sites, they change the TorHS address, so the only way to prevent them from using Tor2web is to apply a filter that look at the specific pattern of their webpage. :(

IMHO it's not a matter of thinking whenever to do it or not (because we have no choice, other than shutting down tor2web), but it's a matter to think how to do it properly.

Doing it properly imho means:

doing it transparently (like today, the blocklist is public)
doing it explicitly (with a dedicated page, providing useful information to users victim of cryptolocker)

evilaliv3 commented 9 years ago

wait @fpietrosanti. i agree to eventually make this regexps lists public but your statement "foing it transparently (like today, the blocklist is public)" is not true. currently the blacklist is hashed so already we are filtering in a not transaparent way but as you know there is no solution for this.

the overhead of applying a regexp filtering would be a lot as currently we are applying regexp replacements only on HTML files while in the future we will have to check whatever content, but the problem i see is not this.

but i've reflected on the problem and it's not so easy to implement a filter based on detection: suppose a website like this: antani.onion/info1.html antani.onion/info2.html antani.onion/malware.exe how do you suggest to reach the goal to filter them all under the following conditions: 1) you don't know the name antani.onion 2) we want to apply a maximum of 1 regexp per website to make it feasible

the only solution i see is eventually to implement a detection on malware.exe and disable the entire hidden service is it serves a file with a specific sequence but also in this case it would be really really really simple for a website implement a malware that results always in a different payload.

lastknight commented 9 years ago

When I proposed this 3y ago, foreseeing need of this kind of pattern, I was almost stoned in a public place.

Nice to see I was right to begin with and that zealots seem to lose their Zealotism when the surname proposing censorship on regexp isn't mine.

But we know it: Coherence is an overstated virtue...

Matteo G.P. Flora - CEO & Founder The Fool srl - The Digital Reputation Company

Via Merano, 16 | 20127 Milano | http://thefool.it off. +39.02-00613665 | mob. +39.347.9676430

The Fool is part of Samadhi Holding - http://samadhi.io Il 08/nov/2014 18:24 "Giovanni Pellerano" notifications@github.com ha scritto:

wait @fpietrosanti https://github.com/fpietrosanti. i agree to eventually make this regexps lists public but your statement "foing it transparently (like today, the blocklist is public)" is not true. currently the blacklist is hashed so already we are filtering in a not transaparent way but as you know there is no solution for this.

the overhead of applying a regexp filtering would be a lot as currently we are applying regexp replacements only on HTML files while in the future we will have to check whatever content, but the problem i see is not this.

but i've reflected on the problem and it's not so easy to implement a filter based on detection: suppose a website like this: antani.onion/info1.html antani.onion/info2.html antani.onion/malware.exe how do you suggest to reach the goal to filter them all under the following conditions: 1) you don't know the name antani.onion 2) we want to apply a maximum of 1 regexp per website to make it feasible

the only solution i see is eventually to implement a detection on malware.exe and disable the entire hidden service is it serves a file with a specific sequence but also in this case it would be really really really simple for a website implement a malware that results always in a different payload.

— Reply to this email directly or view it on GitHub https://github.com/globaleaks/Tor2web-3.0/issues/151#issuecomment-62265749 .

fpietrosanti commented 9 years ago

In the meantime we got contact with few AV vendors that collaboratively shared lists of CryptoWall URLs and are willing to do so in the future to help T2W

hellais commented 9 years ago

Just to note my objections also here on the public ticket. I think overall this approach is not something that will really end up solving the issue at hand. If the sites in question are already rotating their hidden service addresses, what stops them from also dynamically changing the content of their page (or whatever other fingerprint we come up with) to avoid being blocked?

We have seen from the censor vs circumvention fight that whatever a censor does to block access to some content there will always be a way to circumvent that blocking. Blacklist based filtering simply does not work.

So to sum it up I think that this solution adds as negative consequences:

1) The fact that we will start doing blocking based on content that is, in my opinion, ethically wrong.

2) The fact that we add an important computational overhead to tor2web (we need to run regular expressions on every payload we serve)

3) It's more time consuming to update a regexp fingerprint than it is to add a URL for filtering, making the updating process more cumbersome.

... and it doesn't even solve the problem in a definitive way.

Overall I don't think this strategy is something that should be pursued, but the problem should be tackled with the arms that we already have at hand (legal, getting lists of URLs, etc.)

virgil commented 9 years ago

I'm not at liberty to discuss the details in an open-forum, but in short the URLs I wish to see blocked do not intentionally try to get through to the clear-web, and they do not rotate their directory structure---just the .onions.

fpietrosanti commented 9 years ago

I think that if such a functionality it's required, there are 2-3 botnet C&C that cannot be blocked in other way.

The alternative is to place a snort network IDS running on 127.0.0.1 with active TCP RST connection reset.

It's probably better to have something like that in tor2web.

tor2web / Tor2web

Support blacklisting based on "Regexp" to filter out bad-sites #151