tor2web / Tor2web

Tor2web is an HTTP proxy software that enables access to Tor Hidden Services by mean of common web browsers
https://www.tor2web.org
GNU Affero General Public License v3.0
705 stars 176 forks source link

Blocklist - child URLs should also be blocked #280

Closed wtf closed 8 years ago

wtf commented 8 years ago

For a blocklist entry such as blahblahblahblah.onion/a/b/ all of it's child URLs should also be blocked, e.g.:

evilaliv3 commented 8 years ago

thank you for pointing this feedback on your use case @obtuse

given the current basic hashed implementation this is not possible because from the hash of blahblahblahblah.onion/a/b/file1.php it would not be possible to see that blahblahblahblah.onion/a/b/file2.php should be blocked also.

this tickets already tracks this issue; feel free to contribute on that: https://github.com/globaleaks/Tor2web/issues/42

wtf commented 8 years ago

It is definitely possible. If blahblahblahblah.onion/a/ is in the blocklist, we can definitely also block blahblahblahblah.onion/a/b/ and all other child URLs. Will submit a patch for this tomorrow!

evilaliv3 commented 8 years ago

i imagine you are thinking to doing an md5 for each character of the url but this would not be feasible for the load that it will cause.

or you have something else in mind?

to solve the issue as i discussed on some other ticket we should have a different filtering approach having the md5, the kind of filter, the lenght of the string before it is hashed

anyway thank you so much for getting on this!

virgil commented 8 years ago

IMHO Supporting these only for cleartext blocklist entries seems easiest to me.

evilaliv3 commented 8 years ago

i'm not open to have dorks for critical contents hosted by tor2web nodes :)

anyway the issue would not be solved by a cleartext list cleartest or not, a flat list does not specify the kind of filter applied.

with flat i mean like now:

block_entry1
block_entry2
block_entry3

an idea i had (without adding any support database, and having in mind a backport of existing blocks to have a format like block_hash|lenght_unhashed|type

block_entry1|20|0 <- type 0 is the one existing now that checks sadly all:
fullurl/path/subdomain/ (3 checks)
block_entry2|15|1 <- type 1 is the new one suggested by @obtuse and simply
take the first 15 chars of the path, apply the filter an see if matches

but such a kind of implementation would have a linear growiung overhead and cause adding 1 entry to the list will make doing a new check for the url

let's wait that @obtuse clarify what he has in mind

fpietrosanti commented 8 years ago

I think that we could introduce a purely clear text regexp filters on URI parameter to fix in 1 shot all possibly current and future filtering requirements, so anyone that have an "more sofisticated filtering requirement" will just not have the md5-based storage property

Sent from mobile

On 23/gen/2016, at 23:52, Virgil Griffith notifications@github.com wrote:

IMHO Supporting these only for cleartext blocklist entries seems easiest to me.

— Reply to this email directly or view it on GitHub.

evilaliv3 commented 8 years ago

the solution implemented by @obtuse in https://github.com/globaleaks/Tor2web/pull/281 is brilliant; okay it does exactly what i had in mind but reduced to the / in the url and not to every single char; so it will consume a lot of more resources in hashes (one hash for each / in the url but i would like to give it try and see if it is not impacting that much.

thanks @obtuse

in addition to that i'm now going to add an additional possibility for a clearext based regexp based filter to be used if and only if the required filter is not doable using the hashed filtters.

wtf commented 8 years ago

Glad to help!

I was worried about the performance impact too, so I did a rudimentary analysis of performance on a non-idle $5 VPS:

Time required to compute 1M unique md5 hashes = 2.2s => Time required per URL (15 hashes) = ~0.033ms => URLs hashed per second = ~30k

Which is a <3.3% performance penalty while serving 1,000 requests per second.

Overall, the CPU usage of our tor2web nodes seems largely unaffected.

evilaliv3 commented 8 years ago

Great! thank for your time and effort in analyzing it @obtuse!

today i did the following:

I'm now going to update the wiki page describing an example for the regexps use

wtf commented 8 years ago

Awesome!