Add blacklisting based on hostname

robingustafsson commented 5 years ago

Similar to how k6 supports blacklisting of IP ranges it should allow blacklisting of hostnames, including wildcards.

I propose we add this functionality as follows:

If the hostname of a requested URL matches an entry in the blacklist no DNS, TCP or HTTP traffic is generated to that host whatsoever, and the necessary error_code and error tags should be emitted and a JS BlacklistedHotname error will be thrown, that unless caught will exit the current VU iteration.
A new CLI option --blacklist-hostname HOSTNAME that can be specified multiple times, and where HOSTNAME can contain * wildcards.
A new env var option K6_BLACKLIST_HOSTNAMES with a comma separated list of hostnames to block, again wildcards allowed.

A new JS option blacklistHostnames:

export let options = {
blacklistHostnames: ["example.com", "*.example.com"]
};

Pointers for implementation:

See how the blacklisting of IP blocks is implemented: https://github.com/loadimpact/k6/blob/master/lib/netext/dialer.go#L69

na-- commented 5 years ago

Shouldn't we just use regular expressions instead of wildcards? It will be both more flexible, and easier to implement... And if we want to expose this in a user-friendly UI, converting wildcards to regular expressions is also trivial, while the reverse is impossible...

bookmoons commented 5 years ago

I'm looking into this.

@na-- I can see you want to ensure this doesn't impede valuable features.

What would you both think about an expanded syntax?

Wrap a specifier in // to define with a regex, in golang syntax. /(ruby|diamond)\.example\.com/
An unwrapped specifier allows simple wildcards. `.example.com`
An unwrapped specifier without wildcards is a static host. example.com

This can be accepted through all option channels. They can all be translated to regex internally so there's a single match logic.

--blacklist-hostname "/(ruby|diamond)\.example\.com/"
--blacklist-hostname "*.example.com"
--blacklist-hostname example.com

K6_BLACKLIST_HOSTNAMES="/(ruby|diamond)\.example\.com/"
K6_BLACKLIST_HOSTNAMES=*.example.com
K6_BLACKLIST_HOSTNAMES=example.com

blacklistHostnames: [ "/(ruby|diamond)\\.example\\.com/" ]
blacklistHostnames: [ "*.example.com" ]
blacklistHostnames: [ "example.com" ]

na-- commented 5 years ago

@bookmoons, yeah, this approach seems very good to me! The simple things are kept simple, but you can make complicated blocks as well.

krashanoff commented 4 years ago

I noticed that this issue has been stagnant for a while. Mind if I try taking it on?

na-- commented 4 years ago

@krashanoff, sure, go ahead!

krashanoff commented 4 years ago

@na-- Getting on this right now, just a quick question: should we enforce valid hostname patterns, or should that be left to the user? Also, should blacklisted hosts throw the same error code as blacklisted IPs, or use a new one?

na-- commented 4 years ago

@krashanoff, I don't think you can enforce valid hostname patterns, if you go with the approach suggested in https://github.com/loadimpact/k6/issues/972#issuecomment-500879650. I'd be tricky when you have --blacklist-hostname "/[\w]+\.example\.com/"" to say for sure if it always matches a valid domain.

That said, I just realized that implementing this with regexes will probably incur some potentially serious performance issues for large lists, similar to the issues of the current naive implementation of IP blocking (https://github.com/loadimpact/k6/issues/1256). If we allow regexes, I can't think of an easy implementation approach that is not just iterating over the list of blacklisted hostnames and checking against each... At that point we might as well extend this and call it "blacklisted URLs" and allow blocking by any part of the URL...

Whereas if we allow only plain hostnames and leaf * wildcards, there's a somewhat easy way to efficiently implement it, by reversing the domains and constructing a map (flat or nested) or a trie, so that the lookup will be much faster... This only makes sense if we expect to have large blacklists often though. For a few items, iterating over them will probably be faster...

And we might also want to use a different option name than --blacklist-hostname... While I don't think the etymology of the word is problematic (see also this), in the context of k6, --block-hostname, --skip-hostname and --ignore-hostname are more than viable alternatives. Personally, I think I prefer --ignore-hostname, since it conveys the intention behind the option even more clearly to me...

In short, sorry for the delay @krashanoff, but I think there's some further evaluation of this issue needed, before you can start implementing it. cc @robingustafsson @MStoykov @imiric, what are your thoughts about :arrow_up: ?

imiric commented 4 years ago

Agree with most of what you said @na--.

We shouldn't bother with validating hostnames from this list, except for maybe the minimum of [0-9A-Za-z\.-] before starting the test. Otherwise we'd have to keep this validation up to date, which is an exercise in frustration to say the least.
Regexes will be expensive during the test, but a quick ASCII sanity check before building the trie would likely be fine.
--ignore-hostname sounds good, though --block-hostname will probably be more familiar to everyone. And yeah, let's avoid using "blacklist" given the current climate. :)

mstoykov commented 4 years ago

I like the idea of the trie and reversing it ( I remember seeing this done for exactly this, somewhere). So we propose that:

the only thing you can do as a "wildcard" will be to have * as a whole label within the hostname such as *.example.com
Or we want to support *something.example.com so it catches whateversomething.example.com.
How about something*.example.com catching somethingelse.example.com
Is something.*.example.com meaning that else.example.com is fine but something.else.example.com isn't
Also does *.example.com mean that something.else.example.com is fine or will it get blocked? or will that need to be **.example.com for example?

Given that k6 won't just ignore them but actually block them I would argue for --block-hostname
Not really certain if checking the input is really worth it ... but if it comes naturally I am fine with it.
I do think that the error should be a different one though :D

na-- commented 4 years ago

You've convinced me about the option name, so :+1: for --block-hostname / K6_BLOCK_HOSTNAMES / blockHostnames from me as well.

In regards to the wildcard, in the initial version of this, I think we should only allow wildcards in the beginning of the value, i.e. *something, so that when you reverse the string, the wildcard will be the leaf of the trie. And that wildcard should block parts of the subdomain and whole chains of subdomains afterwards, i.e. --block-hostname="*test.example.com" should block all of the following domains:

test.example.com
thetest.example.com
sub.test.example.com
sub.sub.test.example.com

This seems relatively easy to implement if you treat domains as reversed strings, while also fairly powerful and devoid of any surprise behavior to users, I think...

robingustafsson commented 4 years ago

Agree with what's been said. We definitely need wildcards because the block lists would otherwise potentially be even larger than they'll already be (eg. k6 Cloud). Only allowing the wildcard in the beginning is fine.

hynd commented 4 years ago

Just throwing something out there... In the majority of my tests, i usually have the opposite need - to only allow specific hosts (ie; my own domain) and deny all the others (which are invariably third parties who i don't want to annoy). Would a companion option (--allow-hostname?) using a similar mechanism to this, or even a single option that covers both (--filter-hostname?) make sense?

krashanoff commented 4 years ago

Thank you all for taking the time to write back! This is my first time contributing to open source so it's been a good experience. Just summing up what I gleaned from the above conversation, it seems like this is the consensus on implementation:

The option will be named --block-hostname, K6_BLOCK_HOSTNAMES, and blockHostnames.
Blocked hostname errors will have a new code thrown.
Don't bother validating blocked hostname strings past a short check for [0-9A-Za-z\.-\*].
Wildcards are only permitted at the start of a value. Something like foo.*.bar is not permitted, but something like *.foo.bar is.
Blocked hostname strings will be stored backwards in a trie.

The only thing that is really left up to debate at this point is whether regexes should be permitted. I personally think it would be nice to have, but as @na-- mentioned, there's big issues with performance and implementation. I was tooling around with using Go's regexp for blacklisting last night, and found that things get nasty fast if one mirrors the current implementation pattern of IP blacklisting with regexp. Lines like (*net.IPNet)(ipnet).Contains(ip) are not ideal in my opinion.

Pertaining to @hynd's idea of --allow-hostname, I think it would make sense to have iff it were mutually exclusive with --block-hostname.

na-- commented 4 years ago

@krashanoff, @hynd, sorry for the late response... :disappointed:

Regarding --allow-hostname or something similar, can you elaborate more on your use case? To me, --allow-hostname seems like an option that's more suitable for the various converters, especially the HAR to k6 converters. And indeed, we have the --only flag in the old k6 convert HAR converter and we plan to implement something like that in the new standalone HAR converter: https://github.com/loadimpact/har-to-k6/issues/39

The biggest use case I can think for this is, when you've recorded a browser session, to exclude requests to any external services like analytics, ads, script CDNs, etc. So if you record an example.com browser session, you can just skip all requests to external websites recorded in the HAR file by just allowing --allow-hostname="*.example.com". It's better for these requests to never be included in the k6 script file at all, both to keep it cleaner, and to prevent you from having a bunch of errors in the metrics.

Whereas, to me, the biggest use case for an --block-hostname flag would be abuse prevention for things like k6 cloud. We can't control what scripts people run, but if a someone complains that our service is being used to DDoS their website/API/etc. without their consent, we'd be easily able to add their domain to the list of blocked ones.

There are other benefits and non-k6-cloud uses of the --block-hostname flag, of course. It can also be used to block external requests, albeit more clumsily than something like --allow-hostname, and it can allow people to selectively exclude some parts of a test run temporarily... But for a lot of these use cases there are usually other, somewhat cleaner, ways to achieve them, like environment variables and if statements, or the upcoming #1007 scenarios option...

@krashanoff, regarding your other points, :+1: for the option name, new error code, simple validation (though I'm not sure how these should be handled), wildcards only in the beginning, trie-based implementation.

:-1: for regex support, for the initial version of this feature at least... We can always change that decision in the future, and https://github.com/loadimpact/k6/issues/972#issuecomment-500879650 illustrates how we can add regex support in a backwards compatible way, if we decide to do so...

hynd commented 4 years ago

Ah fair point,--allow does seem more appropriate on the converters - i think i was probably conflating the two steps and hadn't spotted the --only flag. I guess unlike some other tools which can slurp up any referenced media etc at run time, it's not so easy (in a good way!) with k6 to go chasing down arbitrary urls (and if my endpoint redirects requests to some random domain by mistake, that's on me 😊). Thanks for your detailed reply!

krashanoff commented 4 years ago

No worries about the late reply, thank you for taking the time. I think a preliminary implementation will just validate wildcard placement. Then, to account for internationalized domain names, we can use a combined character class of (\pL|[0-9]). In any case, I'll just implement the check as a regex so it can be changed later. For handling intl' names in the trie, we can just use Runes for each node value.

Thank you for poring over the details and the detailed responses. Will get to work on this very soon :+1:

grafana / k6

Add blacklisting based on hostname #972