Closed kwilczynski closed 4 months ago
@kwilczynski thanks for digging in deep into this. You can see all the code we use for responding to the curl command here - https://github.com/kubernetes/registry.k8s.io/tree/main/cmd/archeio
it's a cloud run application running in google infra. While we do get the client IP, we do not try to parse User-Agent
, you can see some of the code here:
https://github.com/kubernetes/registry.k8s.io/blob/5443169a57b2d5e8a583930b247e94ba36ca5772/pkg/net/clientip/clientip.go#L27-L37
please feel free to clone the repo and peek if you spot something!
Scanning the github-verse quickly, the 403 may be an attempt by some application firewall (Cloud Armor) to reject traffic from some tools they consider hostile?
https://github.com/mazen160/bfac/blob/18fb0b5dc05005d4f39c242609bbf2347ca0d421/bfac#L257-L259
(No, i have no clue what other strings may be considered in the same fashion!)
[...]
it's a cloud run application running in google infra. While we do get the client IP, we do not try to parse
User-Agent
, you can see some of the code here: [...] Scanning the github-verse quickly, the 403 may be an attempt by some application firewall (Cloud Armor) to reject traffic from some tools they consider hostile? mazen160/bfac@18fb0b5
/bfac#L257-L259
@dims, since the registry service itself is very simple, and we didn't expect it to be anything but, what blocks these requests is probably set up somewhere as part of the infrastructure that Google donates that runs and supports the registry itself.
You mentioned Cloud Armor—we were thinking that there perhaps is some sort of a transparent proxy or WAF (Web Application Firewall) deployed somewhere or even that the registry is perhaps fronted by Cloudflare or such (which is also popular).
The IP address 34.96.108.209 we get back for registry.k8s.io, which also resolves to the same IP from different networks/locations, is within Google's 34.64.0.0/10 network. As such, I bet it's the WAF/Cloud Armor setting of sorts, and Cloud Armor is quite sophisticated, that is looking for the string "bfac" anywhere within the User-Agent value it gets as part of the request.
Would you be able to verify Cloud Armor configuration, just out of curiosity and to make sure it is indeed it?
Re: https://github.com/mazen160/bfac —the project has an option to randomly pick other user agent to make it appear as a popular browser, etc., as such, I am not sure how much "bad traffic" simply blocking "bfac" sheds, perhaps not a lot.
Would you be able to verify Cloud Armor configuration, just out of curiosity and to make sure it is indeed it?
We're using standard rules, the full configuration is open source:
https://registry.k8s.io => https://github.com/kubernetes/registry.k8s.io
The community deployment configs are documented at in the k8s.io repo with the rest of the community infra deployments, but primarily here.
https://github.com/kubernetes/k8s.io/tree/main/infra/gcp/terraform/k8s-infra-oci-proxy-prod is the main deployment
The armor rules are here: https://github.com/kubernetes/k8s.io/blob/main/infra/gcp/terraform/modules/oci-proxy/cloud-armor.tf
I'm not sure which ruleset contains this, but we can drop most of these.
We shouldn't disable armor entirely because we're using a custom policy for rate limiting but most of these rule sets are probably irrelevant.
We can iterate on the staging instance (DO NOT depend on this endpoint, but for testing purposes we can iterate at registry-sandbox.k8s.io).
The other complication: The main reason we've kept these WAF rules is actually to deny spammy vuln scanner noise at the edge.
We get a TON of noisy requests from automated scanning (... and pull-through caches attempting to pull anything and everything) and any request we can deny at the loadbalancer saves the project funds versus letting them get through to the application we use to split load for valid requests between the different cloud storage endpoints ... funds we can use for CI etc instead.
So we'll want to still on balance block known "attack" requests with WAF, and it's much easier to use a pre-supplied ruleset than develop and maintain our own.
@BenTheElder, thank you for some of this. Appreciated.
The rule that is catching this specific User-Agent is most definitely the:
rule {
action = "deny(403)"
priority = "920"
match {
expr {
expression = "evaluatePreconfiguredWaf('scannerdetection-v33-stable', {'sensitivity': 1})"
}
}
description = "Scanner detection"
preview = false
}
Per Google's own documentation:
This will lead us to the following:
This is where OWSAP folks decided to block some of the tools, including the "BFAC" project. Albeit, some of the names of the projects from this list there do pass... so it's a bit puzzling which version of the ruleset Google is using exactly.
That said, I don't think there is anything that we could sensibly do here....
Some of the projects already use different User-Agent strings, often to mimic curl or popular browsers, so there is no helping it here too, sadly.
As such, we on the CRI-O's side will strip the extra build and release information from the User-Agent, which should limit the possibility of running into some other combination of letters, like "bfac", that would match WAF rules.
On the note of the WAF rules... I wish these were a bit tighter, such that they would match User-Agents more precisely, rather than just a specific word or letter combination anywhere within the entire header. However, it's faster this way and requires less maintenance over time, so it is what it is.
So, this is it, I suppose. Unless you have some more thoughts?
we on the CRI-O's side will strip the extra build and release information from the User-Agent
Yes please. That's it. (Rules are going to be constantly updated no matter what to keep up with new spam/bot crap)
@dims I think containerd also includes git commit for pre-release builds, but that's maybe less concerning since tagged releases don't (I think??) ... we should probably take a look at how likely we are to run into this again with other common tools.
I don't love any of the options here. We could invest in custom rules but I think it would take a lot of time and effort to maintain, at the moment this is pretty hands-off and we're spending a lot of time on other sustainability areas.
/cc @AkihiroSuda
So Suda-san can take a look at User-Agent in containerd.
[...]
I don't love any of the options here. We could invest in custom rules but I think it would take a lot of time and effort to maintain, at the moment this is pretty hands-off and we're spending a lot of time on other sustainability areas.
@BenTheElder, yeah. Like I said, it would be a headache, indeed.
Protecting the registry, whichever way we can, takes the precedence here. This goes without saying.
/cc @AkihiroSuda
So Suda-san can take a look at User-Agent in containerd.
This was once discussed and rejected
At least this appears to be limited to non-release-tagged versions? But that's still going to impact someone at some point.
I was thinking about this some more, I think we could actually write some pretty simple rules that just reject most garbage requests at the edge purely based on path and hope that's sufficient, drop the standard WAF rules.
WIP at https://github.com/kubernetes/k8s.io/pull/6969
It will be a little bit more annoying to support additional endpoints in the future, but that seems OK
This is now deployed, though I can't make promises about the behavior of any leaky backend hosts we redirect to.
We're considering handling that differently but it would be more of a long term project.
I don't think anything we currently use would block requests purely based on header substrings anymore, only invalid request paths, or excessive usage.
Is there an existing issue for this?
What did you expect to happen?
Following a maintenance release of the CRI-O version 1.27, a number of CI jobs that Red Hat runs to test and verify our OpenShift releases reported issues while running the usual set of tests.
After looking closer at the actual error, we noticed that pulling images from registry.k8s.io was no longer possible, as the registry denied us access. An example of an image pull as reported by CRI-O:
Related:
Since no other CRI-O release was reporting while we ran the same tests for it, everyone involved in troubleshooting this issue assumed that perhaps some combination of the following was causing a problem:
However, while troubleshooting the issue further, the possibility of regression or some code change introducing a new bug that results in CRI-O not being able to pull images anymore was ruled out in due process. The code and the code paths leading to the code responsible for fetching container images haven't changed in a while, and it's also used in other CRI-O releases that do not appear to have had any issues accessing the remote registry.
Thus, what was left to do was to investigate the requests and responses to and back from the registry, look at how the requests were made, ensure that there isn't any proxy (direct or transparent) set, that there is no other software that would attempt to intercept network traffic, and that both the round-trip times and network latency were acceptable. We also confirmed that there were no issues with DNS resolution, and we were consistently getting the same IP addresses back.
Having done all the due diligence, we decided to look into reproducing the same issue locally to see whether perhaps the source IP address our test servers were originating from was denied.
At this point, were we able to reproduce the problem with accessing the registry at registry.k8s.io reliably with a simple curl invocation:
We now had a reliable way to reproduce the problem, and it wasn't specific to any network or location.
Then... with some trial and error...
We have realised that there is something wrong with the User-Agent header that we sent to the remote registry. So, with a little more testing, we were able to determine that when the following Git commit information (part of our build information the version string carried), things were working fine...
Removed from the User-Agent header:
With a little more testing, we were able to identify a string "bfac", that when included anywhere within the User-Agent header, would cause the registry to deny the request and return 403 back.
... and only the string "bfac":
Clearly, the registry is blocking this specific User-Agent. Why? We aren't sure. We suspect it might have something to do with the following project:
Also, the odds of us hitting this specific problem with a given Git build hash are so extremely minuscule that it left us completely dumbfounded with slight disbelief. :smile:
Debugging Information
The CRI-O version:
(using the crio --version command)
The User-Agent header CRI-O sets:
Other headers that CRI-O will typically set as part of the request:
Anything else?
We were thinking about a few options that could potentially address this issue, even though this appears to have been done on purpose rather than an accidental thing or a bug of sorts.
Some ideas include:
If none of the above options are feasible, we will consider a fix on our side, on CRI-O's side that is.
Thoughts?
Code of Conduct