rancher / fleet

Deploy workloads from Git to large fleets of Kubernetes clusters
https://fleet.rancher.io/
Apache License 2.0
1.51k stars 227 forks source link

Helm repo stored in OCI is not able to be downloaded by fleet #2246

Open kloudwrangler opened 7 months ago

kloudwrangler commented 7 months ago

Is there an existing issue for this?

Current Behavior

I want to install the upstream logging operator using the following fleet.yaml

defaultNamespace: cattle-logging-system
helm:
  releaseName: logging-operator
  chart: "oci://ghcr.io/kube-logging/helm-charts/logging-operator"
  version: 4.5.6

I then create a gitrepo with only this definition.

apiVersion: fleet.cattle.io/v1alpha1
kind: GitRepo
metadata:
  name: logging-operator
  namespace: fleet-default
spec:
  repo: https://github.com/foo/bar.git
  branch: logging-operator
  clientSecretName: auth-xxxx
  insecureSkipTLSVerify: false
  paths:
    - logging-operator/chart

Fleet fails to create a bundle from this as the gitjob logs have the following error.

time="2024-03-20T13:58:18Z" level=info msg="Deleting failed job to trigger retry fleet-default/logging-operator-a494d due to: time=\"2024-03-20T13:58:15Z\" level=fatal msg=\"Helm chart download: failed to do request: Head \\\"https://ghcr.io/v2/kube-logging/helm-charts/logging-operator/manifests/4.5.6\\\": dial tcp 4.208.26.196:443: i/o timeout\"\n"

As you can see from the log, its ignoring the oci and taking the address as regular helm repo.

Expected Behavior

I would expect the oci address to be kept the same in the log and for fleet to try to actually download the oci chart.

Steps To Reproduce

Install the upstream logging operator using the following fleet.yaml

defaultNamespace: cattle-logging-system
helm:
  releaseName: logging-operator
  chart: "oci://ghcr.io/kube-logging/helm-charts/logging-operator"
  version: 4.5.6

I then create a gitrepo with only this definition.

apiVersion: fleet.cattle.io/v1alpha1
kind: GitRepo
metadata:
  name: logging-operator
  namespace: fleet-default
spec:
  repo: https://github.com/foo/bar.git
  branch: logging-operator
  clientSecretName: auth-xxxx
  insecureSkipTLSVerify: false
  paths:
    - logging-operator/chart

Environment

- Architecture: x86-64
- Fleet Version: 0.9.0
- Cluster:
  - Provider: RKE2
  - Options: 3 management nodes, 4 worker nodes
  - Kubernetes Version: Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.9+rke2r1", GitCommit:"d1483fdf7a0578c83523bc1e2212a606a44fd71d", GitTreeState:"clean", BuildDate:"2023-09-13T20:34:35Z", GoVersion:"go1.20.8 X:boringcrypto", Compiler:"gc", Platform:"linux/amd64"}

Logs

No response

Anything else?

No response

bigkevmcd commented 7 months ago

@kloudwrangler

Looking at the log message:

time="2024-03-20T13:58:18Z" level=info msg="Deleting failed job to trigger retry fleet-default/logging-operator-a494d due to: time=\"2024-03-20T13:58:15Z\" level=fatal msg=\"Helm chart download: failed to do request: Head \\\"https://ghcr.io/v2/kube-logging/helm-charts/logging-operator/manifests/4.5.6\\\": dial tcp 4.208.26.196:443: i/o timeout\"\n" While the chart URL is oci:// the API requests are still made via HTTP, and the URL looks correct.

I note the dial tcp 4.208.26.196:443: i/o timeout part of the log, is there any reason that your Fleet installation wouldn't be able to get to ghcr.io?

kloudwrangler commented 7 months ago

While the chart URL is oci:// the API requests are still made via HTTP, and the URL looks correct.

Thanks for this info @bigkevmcd as I can now concentrate on the real issue. I do have an http proxy that it might be ignoring. I would like to keep this ticket open for a few days to ensure that this is in fact operator error.

kloudwrangler commented 7 months ago

Okay, I got more information. As I mentioned before, I have an http proxy and I also have a harbor instance that I am able to pull the oci helm chart manually. I can even install this chart as normal through helm install using the harbor oci address which is "oci://<harbor-site>/ghcr-proxy/kube-logging/helm-charts/logging-operator"

However, when trying to do this with fleet, I change the repo to chart: "oci://<harbor-site>/ghcr-proxy/kube-logging/helm-charts/logging-operator" it gives me the following:

time="2024-03-21T14:00:07Z" level=fatal msg="Helm chart download: failed to copy: httpReadSeeker: failed open: failed to do request: Get \"https://rancher-harbor.s3.eu-west-1.amazonaws.com/docker/registry/v2/blobs/sha256/00/008a6a5af64de4d20014eeb03e613109246cb8fb60943365469ebc96ef4934d6/data?X-Amz-Algorithm=AWS4xxxxxxx": dial tcp 52.218.30.128:443: i/o timeout"

I guess what is happening is that it is changing the address to some kind of rancher default address. I am wondering if this is normal, if I am able to add my harbor instance as a repo or something.

bigkevmcd commented 7 months ago

It's not clear to me how this works, I'm guessing that your proxy intercepts the /ghcr-proxy/ request?

Have you configured fleet for use behind a proxy? https://ranchermanager.docs.rancher.com/v2.8/integrations-in-rancher/fleet/use-fleet-behind-a-proxy

bigkevmcd commented 7 months ago

The URL that it's requesting looks like the standard OCI API endpoint for getting an artifact.

https://github.com/opencontainers/distribution-spec/blob/main/spec.md#pulling-blobs

Fleet uses the Helm packages to fetch the artifacts, so it should be working just as Helm does, which is why I suspect a Proxy misconfiguration somewhere?

pstefka commented 6 months ago

@bigkevmcd

Hey, we have a much simpler setup, but probably hit the same problem:

All the other helm charts (non OCI repositories) are working fine, just 2 = grafana operator and kafka operator using OCI don't work:

fleet time="2024-04-08T12:45:07Z" level=fatal msg="Helm chart download: failed to do request: Head \"https://ghcr.io/v2/grafana/helm-charts/grafana-operator/manifests/v5.7.0\": dial tcp 140.82.121.33:443: i/o timeout"

As can be seen, the gitjob pod is trying a direct connection instead of going through the proxy, though HTTP_PROXY, HTTPS_PROXY, and NO_PROXY environment variable are set.

Strangely when using Rancher 2.7.9 / fleet 0.8.x the OCI charts were installed correctly behind proxy.

manno commented 6 months ago

Proxy support was fixed for 2.8.3. https://github.com/rancher/fleet/issues/2000

It seems the OCI downloader is separate. I think the following upstream issue applies: https://github.com/helm/helm/issues/12770

You probably can't change the proxy env vars to lowercase, since Fleet sets them.

This could be fixed by bumping the Helm SDK to 3.14.4 tomorrow.

pstefka commented 6 months ago

Although not mentioned in the original comment, we've created a copy of the gitjob pod with lowercase proxy environment present, but that didn't help.

Seems like a regression, as with fleet 0.8.x the OCI charts work flawlessly through proxy (no direct connectivity).