Closed robingustafsson closed 4 years ago
Shouldn't we just use regular expressions instead of wildcards? It will be both more flexible, and easier to implement... And if we want to expose this in a user-friendly UI, converting wildcards to regular expressions is also trivial, while the reverse is impossible...
I'm looking into this.
@na-- I can see you want to ensure this doesn't impede valuable features.
What would you both think about an expanded syntax?
/(ruby|diamond)\.example\.com/
example.com
This can be accepted through all option channels. They can all be translated to regex internally so there's a single match logic.
--blacklist-hostname "/(ruby|diamond)\.example\.com/"
--blacklist-hostname "*.example.com"
--blacklist-hostname example.com
K6_BLACKLIST_HOSTNAMES="/(ruby|diamond)\.example\.com/"
K6_BLACKLIST_HOSTNAMES=*.example.com
K6_BLACKLIST_HOSTNAMES=example.com
blacklistHostnames: [ "/(ruby|diamond)\\.example\\.com/" ]
blacklistHostnames: [ "*.example.com" ]
blacklistHostnames: [ "example.com" ]
@bookmoons, yeah, this approach seems very good to me! The simple things are kept simple, but you can make complicated blocks as well.
I noticed that this issue has been stagnant for a while. Mind if I try taking it on?
@krashanoff, sure, go ahead!
@na-- Getting on this right now, just a quick question: should we enforce valid hostname patterns, or should that be left to the user? Also, should blacklisted hosts throw the same error code as blacklisted IPs, or use a new one?
@krashanoff, I don't think you can enforce valid hostname patterns, if you go with the approach suggested in https://github.com/loadimpact/k6/issues/972#issuecomment-500879650. I'd be tricky when you have --blacklist-hostname "/[\w]+\.example\.com/""
to say for sure if it always matches a valid domain.
That said, I just realized that implementing this with regexes will probably incur some potentially serious performance issues for large lists, similar to the issues of the current naive implementation of IP blocking (https://github.com/loadimpact/k6/issues/1256). If we allow regexes, I can't think of an easy implementation approach that is not just iterating over the list of blacklisted hostnames and checking against each... At that point we might as well extend this and call it "blacklisted URLs" and allow blocking by any part of the URL...
Whereas if we allow only plain hostnames and leaf *
wildcards, there's a somewhat easy way to efficiently implement it, by reversing the domains and constructing a map (flat or nested) or a trie, so that the lookup will be much faster... This only makes sense if we expect to have large blacklists often though. For a few items, iterating over them will probably be faster...
And we might also want to use a different option name than --blacklist-hostname
... While I don't think the etymology of the word is problematic (see also this), in the context of k6, --block-hostname
, --skip-hostname
and --ignore-hostname
are more than viable alternatives. Personally, I think I prefer --ignore-hostname
, since it conveys the intention behind the option even more clearly to me...
In short, sorry for the delay @krashanoff, but I think there's some further evaluation of this issue needed, before you can start implementing it. cc @robingustafsson @MStoykov @imiric, what are your thoughts about :arrow_up: ?
Agree with most of what you said @na--.
We shouldn't bother with validating hostnames from this list, except for maybe the minimum of [0-9A-Za-z\.-]
before starting the test. Otherwise we'd have to keep this validation up to date, which is an exercise in frustration to say the least.
Regexes will be expensive during the test, but a quick ASCII sanity check before building the trie would likely be fine.
--ignore-hostname
sounds good, though --block-hostname
will probably be more familiar to everyone. And yeah, let's avoid using "blacklist" given the current climate. :)
I like the idea of the trie and reversing it ( I remember seeing this done for exactly this, somewhere). So we propose that:
*
as a whole label within the hostname such as *.example.com
*something.example.com
so it catches whateversomething.example.com
. something*.example.com
catching somethingelse.example.com
something.*.example.com
meaning that else.example.com
is fine but something.else.example.com
isn't *.example.com
mean that something.else.example.com
is fine or will it get blocked? or will that need to be **.example.com
for example?Given that k6 won't just ignore them but actually block them I would argue for --block-hostname
Not really certain if checking the input is really worth it ... but if it comes naturally I am fine with it.
I do think that the error should be a different one though :D
You've convinced me about the option name, so :+1: for --block-hostname
/ K6_BLOCK_HOSTNAMES
/ blockHostnames
from me as well.
In regards to the wildcard, in the initial version of this, I think we should only allow wildcards in the beginning of the value, i.e. *something
, so that when you reverse the string, the wildcard will be the leaf of the trie. And that wildcard should block parts of the subdomain and whole chains of subdomains afterwards, i.e. --block-hostname="*test.example.com"
should block all of the following domains:
test.example.com
thetest.example.com
sub.test.example.com
sub.sub.test.example.com
This seems relatively easy to implement if you treat domains as reversed strings, while also fairly powerful and devoid of any surprise behavior to users, I think...
Agree with what's been said. We definitely need wildcards because the block lists would otherwise potentially be even larger than they'll already be (eg. k6 Cloud). Only allowing the wildcard in the beginning is fine.
Just throwing something out there... In the majority of my tests, i usually have the opposite need - to only allow specific hosts (ie; my own domain) and deny all the others (which are invariably third parties who i don't want to annoy).
Would a companion option (--allow-hostname
?) using a similar mechanism to this, or even a single option that covers both (--filter-hostname
?) make sense?
Thank you all for taking the time to write back! This is my first time contributing to open source so it's been a good experience. Just summing up what I gleaned from the above conversation, it seems like this is the consensus on implementation:
--block-hostname
, K6_BLOCK_HOSTNAMES
, and blockHostnames
.[0-9A-Za-z\.-\*]
.foo.*.bar
is not permitted, but something like *.foo.bar
is.The only thing that is really left up to debate at this point is whether regexes should be permitted. I personally think it would be nice to have, but as @na-- mentioned, there's big issues with performance and implementation. I was tooling around with using Go's regexp
for blacklisting last night, and found that things get nasty fast if one mirrors the current implementation pattern of IP blacklisting with regexp. Lines like (*net.IPNet)(ipnet).Contains(ip)
are not ideal in my opinion.
Pertaining to @hynd's idea of --allow-hostname
, I think it would make sense to have iff it were mutually exclusive with --block-hostname
.
@krashanoff, @hynd, sorry for the late response... :disappointed:
Regarding --allow-hostname
or something similar, can you elaborate more on your use case? To me, --allow-hostname
seems like an option that's more suitable for the various converters, especially the HAR to k6 converters. And indeed, we have the --only
flag in the old k6 convert
HAR converter and we plan to implement something like that in the new standalone HAR converter: https://github.com/loadimpact/har-to-k6/issues/39
The biggest use case I can think for this is, when you've recorded a browser session, to exclude requests to any external services like analytics, ads, script CDNs, etc. So if you record an example.com
browser session, you can just skip all requests to external websites recorded in the HAR file by just allowing --allow-hostname="*.example.com"
. It's better for these requests to never be included in the k6 script file at all, both to keep it cleaner, and to prevent you from having a bunch of errors in the metrics.
Whereas, to me, the biggest use case for an --block-hostname
flag would be abuse prevention for things like k6 cloud. We can't control what scripts people run, but if a someone complains that our service is being used to DDoS their website/API/etc. without their consent, we'd be easily able to add their domain to the list of blocked ones.
There are other benefits and non-k6-cloud uses of the --block-hostname
flag, of course. It can also be used to block external requests, albeit more clumsily than something like --allow-hostname
, and it can allow people to selectively exclude some parts of a test run temporarily... But for a lot of these use cases there are usually other, somewhat cleaner, ways to achieve them, like environment variables and if
statements, or the upcoming #1007 scenarios
option...
@krashanoff, regarding your other points, :+1: for the option name, new error code, simple validation (though I'm not sure how these should be handled), wildcards only in the beginning, trie-based implementation.
:-1: for regex support, for the initial version of this feature at least... We can always change that decision in the future, and https://github.com/loadimpact/k6/issues/972#issuecomment-500879650 illustrates how we can add regex support in a backwards compatible way, if we decide to do so...
Ah fair point,--allow
does seem more appropriate on the converters - i think i was probably conflating the two steps and hadn't spotted the --only
flag. I guess unlike some other tools which can slurp up any referenced media etc at run time, it's not so easy (in a good way!) with k6 to go chasing down arbitrary urls (and if my endpoint redirects requests to some random domain by mistake, that's on me 😊). Thanks for your detailed reply!
No worries about the late reply, thank you for taking the time. I think a preliminary implementation will just validate wildcard placement. Then, to account for internationalized domain names, we can use a combined character class of (\pL|[0-9])
. In any case, I'll just implement the check as a regex so it can be changed later. For handling intl' names in the trie, we can just use Runes for each node value.
Thank you for poring over the details and the detailed responses. Will get to work on this very soon :+1:
Similar to how k6 supports blacklisting of IP ranges it should allow blacklisting of hostnames, including wildcards.
I propose we add this functionality as follows:
error_code
anderror
tags should be emitted and a JSBlacklistedHotname
error will be thrown, that unless caught will exit the current VU iteration.--blacklist-hostname HOSTNAME
that can be specified multiple times, and whereHOSTNAME
can contain*
wildcards.K6_BLACKLIST_HOSTNAMES
with a comma separated list of hostnames to block, again wildcards allowed.blacklistHostnames
:Pointers for implementation: