Unable to access the registry when a specific User-Agent header is set

kwilczynski commented 4 months ago

Is there an existing issue for this?

[X] I have searched the existing issues

What did you expect to happen?

Following a maintenance release of the CRI-O version 1.27, a number of CI jobs that Red Hat runs to test and verify our OpenShift releases reported issues while running the usual set of tests.

After looking closer at the actual error, we noticed that pulling images from registry.k8s.io was no longer possible, as the registry denied us access. An example of an image pull as reported by CRI-O:

[2024-07-02T11:02:50.795Z]             cluster.go:162: E0702 11:02:44.839263    2072 remote_runtime.go:176] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = creating pod sandbox with name \"k8s_rhcos-crio-pod-restart-test_redhat.test.crio__0\": initializing source docker://registry.k8s.io/pause:3.9: pinging container registry registry.k8s.io: StatusCode: 403, <!doctype html><meta charset=\"utf-8\"><meta name=vi..."
[2024-07-02T11:02:50.795Z]             cluster.go:162: time="2024-07-02T11:02:44Z" level=fatal msg="run pod sandbox: rpc error: code = Unknown desc = creating pod sandbox with name \"k8s_rhcos-crio-pod-restart-test_redhat.test.crio__0\": initializing source docker://fg/pause:3.9: pinging container registry registry.k8s.io: StatusCode: 403, <!doctype html><meta charset=\"utf-8\"><meta name=vi..."

https://github.com/openshift/os/issues/1542

Since no other CRI-O release was reporting while we ran the same tests for it, everyone involved in troubleshooting this issue assumed that perhaps some combination of the following was causing a problem:

There is an issue with the way how CRI-O pulls remote container images
Something is wrong with how CRI-O sends requests to the registry; there should be no proxy in between, etc.
Something wrong with the network on the CI server that runs the test jobs
The registry is throttling our access as we were pulling too much
We were often fetching multiple different images while our tests ran and were blocked on purpose

However, while troubleshooting the issue further, the possibility of regression or some code change introducing a new bug that results in CRI-O not being able to pull images anymore was ruled out in due process. The code and the code paths leading to the code responsible for fetching container images haven't changed in a while, and it's also used in other CRI-O releases that do not appear to have had any issues accessing the remote registry.

Thus, what was left to do was to investigate the requests and responses to and back from the registry, look at how the requests were made, ensure that there isn't any proxy (direct or transparent) set, that there is no other software that would attempt to intercept network traffic, and that both the round-trip times and network latency were acceptable. We also confirmed that there were no issues with DNS resolution, and we were consistently getting the same IP addresses back.

Having done all the due diligence, we decided to look into reproducing the same issue locally to see whether perhaps the source IP address our test servers were originating from was denied.

At this point, were we able to reproduce the problem with accessing the registry at registry.k8s.io reliably with a simple curl invocation:

$ curl -vL -H 'User-Agent: cri-o/1.27.8-2.rhaos4.14.gitbfac241.el9 go/go1.20.12 os/linux arch/amd64' https://registry.k8s.io/ ; echo
* Host registry.k8s.io:443 was resolved.
* IPv6: (none)
* IPv4: 34.96.108.209
*   Trying 34.96.108.209:443...
* Connected to registry.k8s.io (34.96.108.209) port 443
* ALPN: curl offers h2,http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384 / x25519 / RSASSA-PSS
* ALPN: server accepted h2
* Server certificate:
*  subject: CN=registry.k8s.io
*  start date: Jul  1 14:56:31 2024 GMT
*  expire date: Sep 29 15:52:26 2024 GMT
*  subjectAltName: host "registry.k8s.io" matched cert's "registry.k8s.io"
*  issuer: C=US; O=Google Trust Services; CN=WR3
*  SSL certificate verify ok.
*   Certificate level 0: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
*   Certificate level 1: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
*   Certificate level 2: Public key type RSA (4096/152 Bits/secBits), signed using sha384WithRSAEncryption
* using HTTP/2
* [HTTP/2] [1] OPENED stream for https://registry.k8s.io/
* [HTTP/2] [1] [:method: GET]
* [HTTP/2] [1] [:scheme: https]
* [HTTP/2] [1] [:authority: registry.k8s.io]
* [HTTP/2] [1] [:path: /]
* [HTTP/2] [1] [accept: */*]
* [HTTP/2] [1] [user-agent: cri-o/1.27.8-2.rhaos4.14.gitbfac241.el9 go/go1.20.12 os/linux arch/amd64]
> GET / HTTP/2
> Host: registry.k8s.io
> Accept: */*
> User-Agent: cri-o/1.27.8-2.rhaos4.14.gitbfac241.el9 go/go1.20.12 os/linux arch/amd64
> 
* Request completely sent off
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
< HTTP/2 403 
< content-type: text/html; charset=UTF-8
< content-length: 134
< via: 1.1 google
< date: Fri, 05 Jul 2024 05:11:09 GMT
< alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000
< 
* Connection #0 to host registry.k8s.io left intact
<!doctype html><meta charset="utf-8"><meta name=viewport content="width=device-width, initial-scale=1"><title>403</title>403 Forbidden

We now had a reliable way to reproduce the problem, and it wasn't specific to any network or location.

Then... with some trial and error...

$ curl -L -H 'User-Agent: banana' https://registry.k8s.io/ 1>/dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    81  100    81    0     0   1290      0 --:--:-- --:--:-- --:--:--  1327
100  342k    0  342k    0     0  3196k      0 --:--:-- --:--:-- --:--:-- 3196k

We have realised that there is something wrong with the User-Agent header that we sent to the remote registry. So, with a little more testing, we were able to determine that when the following Git commit information (part of our build information the version string carried), things were working fine...

Removed from the User-Agent header:

gitbfac241

With a little more testing, we were able to identify a string "bfac", that when included anywhere within the User-Agent header, would cause the registry to deny the request and return 403 back.

... and only the string "bfac":

$ for c in {a..z} ; do echo -ne $c ; curl -L -H "User-Agent: bfa${c}" https://registry.k8s.io/v2/ ; echo ; done
a
b
c<!doctype html><meta charset="utf-8"><meta name=viewport content="width=device-width, initial-scale=1"><title>403</title>403 Forbidden
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z

Clearly, the registry is blocking this specific User-Agent. Why? We aren't sure. We suspect it might have something to do with the following project:

https://github.com/mazen160/bfac

BFAC (Backup File Artifacts Checker): An automated tool that checks for backup artifacts that may disclose the web-application's source code.

Also, the odds of us hitting this specific problem with a given Git build hash are so extremely minuscule that it left us completely dumbfounded with slight disbelief. :smile:

Debugging Information

The CRI-O version:

(using the crio --version command)

crio version 1.27.8-2.rhaos4.14.gitbfac241.el9
Version:        1.27.8-2.rhaos4.14.gitbfac241.el9
GitCommit:      unknown
GitCommitDate:  unknown
GitTreeState:   clean
GoVersion:      go1.20.12
Compiler:       gc
Platform:       linux/amd64
Linkmode:       dynamic
BuildTags:
  rpm_crashtraceback
  libtrust_openssl
  selinux
  seccomp
  exclude_graphdriver_devicemapper
  exclude_graphdriver_btrfs
  containers_image_ostree_stub
LDFlags:           -compressdwarf=false -B 0x2be32a9e8db771d3a24a6d85221072f20f944ef7 -extldflags '-Wl,-z,relro -Wl,--as-needed  -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 '
SeccompEnabled:   true
AppArmorEnabled:  false

The User-Agent header CRI-O sets:

cri-o/1.27.8-2.rhaos4.14.gitbfac241.el9 go/go1.20.12 os/linux arch/amd64

Other headers that CRI-O will typically set as part of the request:

Host: registry.k8s.io
User-Agent: cri-o/1.27.8-2.rhaos4.14.gitbfac241.el9 go/go1.20.12 os/linux arch/amd64
Accept: application/vnd.oci.image.manifest.v1+json
Accept: application/vnd.docker.distribution.manifest.v2+json
Accept: application/vnd.docker.distribution.manifest.v1+prettyjws
Accept: application/vnd.docker.distribution.manifest.v1+json
Accept: application/vnd.docker.distribution.manifest.list.v2+json
Accept: application/vnd.oci.image.index.v1+json
Docker-Distribution-Api-Version: registry/2.0
Accept-Encoding:     gzip

Anything else?

We were thinking about a few options that could potentially address this issue, even though this appears to have been done on purpose rather than an accidental thing or a bug of sorts.

Some ideas include:

Add CRI-O and other major container runtimes to an explicit allowlist
Improve the way how the User-Agent header match is done to match this specific client better
Remove this User-Agent check altogether, as it's easy to see something else, which renders this test somewhat ineffective

If none of the above options are feasible, we will consider a fix on our side, on CRI-O's side that is.

Thoughts?

Code of Conduct

[X] I agree to follow this project's Code of Conduct

dims commented 4 months ago

@kwilczynski thanks for digging in deep into this. You can see all the code we use for responding to the curl command here - https://github.com/kubernetes/registry.k8s.io/tree/main/cmd/archeio

it's a cloud run application running in google infra. While we do get the client IP, we do not try to parse User-Agent, you can see some of the code here: https://github.com/kubernetes/registry.k8s.io/blob/5443169a57b2d5e8a583930b247e94ba36ca5772/pkg/net/clientip/clientip.go#L27-L37

please feel free to clone the repo and peek if you spot something!

Scanning the github-verse quickly, the 403 may be an attempt by some application firewall (Cloud Armor) to reject traffic from some tools they consider hostile?
https://github.com/mazen160/bfac/blob/18fb0b5dc05005d4f39c242609bbf2347ca0d421/bfac#L257-L259

(No, i have no clue what other strings may be considered in the same fashion!)

kwilczynski commented 4 months ago

[...]

it's a cloud run application running in google infra. While we do get the client IP, we do not try to parse User-Agent, you can see some of the code here: [...] Scanning the github-verse quickly, the 403 may be an attempt by some application firewall (Cloud Armor) to reject traffic from some tools they consider hostile? mazen160/bfac@18fb0b5/bfac#L257-L259

@dims, since the registry service itself is very simple, and we didn't expect it to be anything but, what blocks these requests is probably set up somewhere as part of the infrastructure that Google donates that runs and supports the registry itself.

You mentioned Cloud Armor—we were thinking that there perhaps is some sort of a transparent proxy or WAF (Web Application Firewall) deployed somewhere or even that the registry is perhaps fronted by Cloudflare or such (which is also popular).

The IP address 34.96.108.209 we get back for registry.k8s.io, which also resolves to the same IP from different networks/locations, is within Google's 34.64.0.0/10 network. As such, I bet it's the WAF/Cloud Armor setting of sorts, and Cloud Armor is quite sophisticated, that is looking for the string "bfac" anywhere within the User-Agent value it gets as part of the request.

whois for 34.96.108.209

```console NetRange: 34.64.0.0 - 34.127.255.255 CIDR: 34.64.0.0/10 NetName: GOOGL-2 NetHandle: NET-34-64-0-0-1 Parent: NET34 (NET-34-0-0-0-0) NetType: Direct Allocation OriginAS: Organization: Google LLC (GOOGL-2) RegDate: 2018-09-28 Updated: 2018-09-28 Ref: https://rdap.arin.net/registry/ip/34.64.0.0 OrgName: Google LLC OrgId: GOOGL-2 Address: 1600 Amphitheatre Parkway City: Mountain View StateProv: CA PostalCode: 94043 Country: US RegDate: 2006-09-29 Updated: 2019-11-01 Comment: *** The IP addresses under this Org-ID are in use by Google Cloud customers *** Comment: Comment: Direct all copyright and legal complaints to Comment: https://support.google.com/legal/go/report Comment: Comment: Direct all spam and abuse complaints to Comment: https://support.google.com/code/go/gce_abuse_report Comment: Comment: For fastest response, use the relevant forms above. Comment: Comment: Complaints can also be sent to the GC Abuse desk Comment: (google-cloud-compliance@google.com) Comment: but may have longer turnaround times. Comment: Comment: Complaints sent to any other POC will be ignored. Ref: https://rdap.arin.net/registry/entity/GOOGL-2 OrgAbuseHandle: GCABU-ARIN OrgAbuseName: GC Abuse OrgAbusePhone: +1-650-253-0000 OrgAbuseEmail: google-cloud-compliance@google.com OrgAbuseRef: https://rdap.arin.net/registry/entity/GCABU-ARIN OrgNOCHandle: GCABU-ARIN OrgNOCName: GC Abuse OrgNOCPhone: +1-650-253-0000 OrgNOCEmail: google-cloud-compliance@google.com OrgNOCRef: https://rdap.arin.net/registry/entity/GCABU-ARIN OrgTechHandle: ZG39-ARIN OrgTechName: Google LLC OrgTechPhone: +1-650-253-0000 OrgTechEmail: arin-contact@google.com OrgTechRef: https://rdap.arin.net/registry/entity/ZG39-ARIN ```

Would you be able to verify Cloud Armor configuration, just out of curiosity and to make sure it is indeed it?

Re: https://github.com/mazen160/bfac —the project has an option to randomly pick other user agent to make it appear as a popular browser, etc., as such, I am not sure how much "bad traffic" simply blocking "bfac" sheds, perhaps not a lot.

BenTheElder commented 4 months ago

Would you be able to verify Cloud Armor configuration, just out of curiosity and to make sure it is indeed it?

We're using standard rules, the full configuration is open source:

https://registry.k8s.io => https://github.com/kubernetes/registry.k8s.io

The community deployment configs are documented at in the k8s.io repo with the rest of the community infra deployments, but primarily here.

https://github.com/kubernetes/k8s.io/tree/main/infra/gcp/terraform/k8s-infra-oci-proxy-prod is the main deployment

The armor rules are here: https://github.com/kubernetes/k8s.io/blob/main/infra/gcp/terraform/modules/oci-proxy/cloud-armor.tf

BenTheElder commented 4 months ago

I'm not sure which ruleset contains this, but we can drop most of these.

We shouldn't disable armor entirely because we're using a custom policy for rate limiting but most of these rule sets are probably irrelevant.

We can iterate on the staging instance (DO NOT depend on this endpoint, but for testing purposes we can iterate at registry-sandbox.k8s.io).

BenTheElder commented 4 months ago

The other complication: The main reason we've kept these WAF rules is actually to deny spammy vuln scanner noise at the edge.

We get a TON of noisy requests from automated scanning (... and pull-through caches attempting to pull anything and everything) and any request we can deny at the loadbalancer saves the project funds versus letting them get through to the application we use to split load for valid requests between the different cloud storage endpoints ... funds we can use for CI etc instead.

So we'll want to still on balance block known "attack" requests with WAF, and it's much easier to use a pre-supplied ruleset than develop and maintain our own.

kwilczynski commented 4 months ago

@BenTheElder, thank you for some of this. Appreciated.

The rule that is catching this specific User-Agent is most definitely the:

  rule {
    action   = "deny(403)"
    priority = "920"
    match {
      expr {
        expression = "evaluatePreconfiguredWaf('scannerdetection-v33-stable', {'sensitivity': 1})"
      }
    }
    description = "Scanner detection"

    preview = false
  }

Per Google's own documentation:

Google Cloud Armor preconfigured WAF rules overview

This will lead us to the following:

This is where OWSAP folks decided to block some of the tools, including the "BFAC" project. Albeit, some of the names of the projects from this list there do pass... so it's a bit puzzling which version of the ruleset Google is using exactly.

That said, I don't think there is anything that we could sensibly do here....

Removing the ruleset from WAF will invite bots, spam and scams, so not ideal
Trying to manually replicate the "scanner" ruleset from OWSAP would be an unmaintainable headache
Adding a sort of an allowlist that includes most of the popular container runtimes would invite abuse eventually

Some of the projects already use different User-Agent strings, often to mimic curl or popular browsers, so there is no helping it here too, sadly.

As such, we on the CRI-O's side will strip the extra build and release information from the User-Agent, which should limit the possibility of running into some other combination of letters, like "bfac", that would match WAF rules.

On the note of the WAF rules... I wish these were a bit tighter, such that they would match User-Agents more precisely, rather than just a specific word or letter combination anywhere within the entire header. However, it's faster this way and requires less maintenance over time, so it is what it is.

So, this is it, I suppose. Unless you have some more thoughts?

dims commented 4 months ago

we on the CRI-O's side will strip the extra build and release information from the User-Agent

Yes please. That's it. (Rules are going to be constantly updated no matter what to keep up with new spam/bot crap)

BenTheElder commented 4 months ago

@dims I think containerd also includes git commit for pre-release builds, but that's maybe less concerning since tagged releases don't (I think??) ... we should probably take a look at how likely we are to run into this again with other common tools.

I don't love any of the options here. We could invest in custom rules but I think it would take a lot of time and effort to maintain, at the moment this is pretty hands-off and we're spending a lot of time on other sustainability areas.

kwilczynski commented 4 months ago

/cc @AkihiroSuda

So Suda-san can take a look at User-Agent in containerd.

kwilczynski commented 4 months ago

[...]

I don't love any of the options here. We could invest in custom rules but I think it would take a lot of time and effort to maintain, at the moment this is pretty hands-off and we're spending a lot of time on other sustainability areas.

@BenTheElder, yeah. Like I said, it would be a headache, indeed.

Protecting the registry, whichever way we can, takes the precedence here. This goes without saying.

AkihiroSuda commented 4 months ago

/cc @AkihiroSuda

So Suda-san can take a look at User-Agent in containerd.

This was once discussed and rejected

https://github.com/containerd/containerd/issues/6474

BenTheElder commented 4 months ago

At least this appears to be limited to non-release-tagged versions? But that's still going to impact someone at some point.

I was thinking about this some more, I think we could actually write some pretty simple rules that just reject most garbage requests at the edge purely based on path and hope that's sufficient, drop the standard WAF rules.

WIP at https://github.com/kubernetes/k8s.io/pull/6969

It will be a little bit more annoying to support additional endpoints in the future, but that seems OK

BenTheElder commented 3 months ago

This is now deployed, though I can't make promises about the behavior of any leaky backend hosts we redirect to.

We're considering handling that differently but it would be more of a long term project.

I don't think anything we currently use would block requests purely based on header substrings anymore, only invalid request paths, or excessive usage.

kubernetes / registry.k8s.io