pulumi / pulumi-kubernetes

A Pulumi resource provider for Kubernetes to manage API resources and workloads in running clusters
https://www.pulumi.com/docs/reference/clouds/kubernetes/
Apache License 2.0
404 stars 115 forks source link

helm Release tries to call non-existent resources #2481

Closed cbley-da closed 9 months ago

cbley-da commented 1 year ago

What happened?

We have some helm charts that we use with pulumi.

When trying to install the releases to our GKE cluster, we sometimes (non-reliably) receive the the server could not find the requested resource error.

Looking in the messages in the Google Cloud Explorer we can see that it tried to call a method named io.k8s.core.v1.virtualservices.create:

{
  "protoPayload": {
    "@type": "type.googleapis.com/google.cloud.audit.AuditLog",
    "authenticationInfo": {
    },
    "authorizationInfo": [
      {
        "granted": true,
        "permission": "io.k8s.core.v1.virtualservices.create",
        "resource": "core/v1/namespaces/sv-2/virtualservices"
      }
    ],
    "methodName": "io.k8s.core.v1.virtualservices.create",
    "requestMetadata": {
      "callerIp": "",
      "callerSuppliedUserAgent": "Go-http-client/2.0"
    },
    "resourceName": "core/v1/namespaces/sv-2/virtualservices",
    "response": {
      "@type": "core.k8s.io/v1.Status",
      "apiVersion": "v1",
      "code": 404,
      "details": {},
      "kind": "Status",
      "message": "the server could not find the requested resource",
      "metadata": {},
      "reason": "NotFound",
      "status": "Failure"
    },
    "serviceName": "k8s.io",
    "status": {
      "code": 5,
      "message": "the server could not find the requested resource"
    }
  },
  "insertId": "...",
  "resource": {
    "type": "k8s_cluster",
    "labels": {
      "cluster_name": "...",
      "location": "us-central1",
      "project_id": "..."
    }
  },
  "timestamp": "2023-06-27T09:26:24.949679Z",
  "labels": {
    "authorization.k8s.io/reason": "access granted by IAM permissions.",
    "authorization.k8s.io/decision": "allow"
  },
  "logName": "...",
  "operation": {
    "id": "...",
    "producer": "k8s.io",
    "first": true,
    "last": true
  },
  "receiveTimestamp": "2023-06-27T09:26:29.744157Z"
}

We have several such entries for that method in our logs over the past 14 days, but all of them fail with error code 5, "the server could not find the requested resource".

But we do have calls to the method io.istio.networking.v1alpha3.virtualservices.create which return successfully. This looks to me like some kind of race condition that puts together the wrong API prefix plus resource name and operation.

Calling pulumi up again usually gets rid of the problem (at least after some retries).

We also have more such method calls that fail in the same way, such as io.k8s.core.v1.deployments.create (which maybe should have been calls to io.k8s.apps.v1.deployments.create instead).

Expected Behavior

There should be no errors and pulumi up should work reliably.

Steps to reproduce

This is a bit hard to reproduce. I'll add more details or a minimal reproducer when possible.

Output of pulumi about

CLI
Version 3.72.1 Go Version go1.20.5 Go Compiler gc

Plugins NAME VERSION nodejs unknown

Host
OS nixos Version 23.05 (Stoat) Arch x86_64

This project is written in nodejs: executable='/nix/store/p0f8i04zwf1dd66n2qkazk5x0fbsy7mp-nodejs-18.16.1/bin/node' version='v18.16.1'

Backend
Name x1 URL gs://... User claudio Organizations

Additional context

No response

Contributing

Vote on this issue by adding a šŸ‘ reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

rquitales commented 1 year ago

Hi @cbley-da,

Thank you for the detailed exploration with Cloud Explorer and providing this bug report. I haven't come across this issue before, so it's concerning that the GKNN for resources is incorrect. I've conducted a quick scan through our codebase, but I couldn't identify any obvious code path that might trigger this behavior.

While I continue to investigate further, it would be incredibly helpful if you could provide a repro for this issue. It seems like you are attempting to use Helm to install either Istio or Anthos Service Mesh. Could you also re-run pulumi about within the Pulumi project directory? This will provide us with more information about the plugins and their versions being used in your Pulumi program. It would be valuable to know which version of the Kubernetes provider you are using.

Thank you once again for bringing this to our attention, and we appreciate your cooperation in helping us resolve this matter.

cbley-da commented 1 year ago

Hi @rquitales,

thank you for the quick response!

While I continue to investigate further, it would be incredibly helpful if you could provide a repro for this issue.

Trying this now, but maybe it already helps if I describe what we do:

We have multiple Helm charts that we want to install. One of them indeed includes Istio. We install multiple instances of the same chart via a Helm Release by passing different values, e.g.:

function installHelm(
  chartName: string,
  name: string,
  nsName: pulumi.Output<string>,
  values: ChartValues = {},
  dependsOn: (pulumi.Resource | pulumi.Output<pulumi.Resource>)[] = []
): pulumi.ProviderResource {
  return new k8s.helm.v3.Release(
    `helm-${prefix}-${name}`,
    {
      name,
      namespace: nsName,
      chart: process.env.REPO_ROOT + '/cluster/helm/' + chartName + '/',
      values: cnChartValues(chartName, values),
      timeout: HELM_CHART_TIMEOUT_SEC,
    },
    {
      dependsOn,
    }
  );
}

installHelm(chart, name1, ns1, values: { ... })
installHelm(chart, name2, ns2, values: { ... })
installHelm(chart, name3, ns3, values: { ... })
installHelm(chart, name4, ns4, values: { ... })

These are installed in different namespaces and with unique names of course.

Could you also re-run pulumi about within the Pulumi project directory? This will provide us with more information about the plugins and their versions being used in your Pulumi program.

I already did (see output above). We are using pulumi from nixpkgs and somehow pulumi about does not work in this case.

Also, we are using npm workspaces, ie. we have a directory layout as this:

pulumi/
  project1/
     package.json
     src/
        index.ts
  project2
     package.json
     src/
        index.ts
  ...
  package.json
  package-lock.json

So the package-lock.json file is located in the workspace directory, not beside the package.json for each of the projects.

Here's the output of npm ls --include-workspace-root instead:

npm ls  --depth=0 --include-workspace-root
project-pulumi-deployment@1.0.0 /home/claudio/project/cluster/pulumi
ā”œā”€ā”€ @pulumi/gcp@v6.50.0
ā”œā”€ā”€ @pulumi/kubernetes-cert-manager@v0.0.5
ā”œā”€ā”€ @pulumi/pulumi@3.72.0
ā”œā”€ā”€ @trivago/prettier-plugin-sort-imports@3.4.0
ā”œā”€ā”€ @types/js-yaml@4.0.5
ā”œā”€ā”€ @types/lodash@4.14.191
ā”œā”€ā”€ @types/node@14.18.36
ā”œā”€ā”€ @typescript-eslint/eslint-plugin@5.54.0
ā”œā”€ā”€ @typescript-eslint/parser@5.54.0
ā”œā”€ā”¬ subproject1-pulumi-deployment@1.0.0 -> ./subproject1
ā”‚ ā”œā”€ā”€ @kubernetes/client-node@0.18.1
ā”‚ ā”œā”€ā”€ @pulumi/random@v4.13.2
ā”‚ ā”œā”€ā”€ @types/auth0@3.3.2
ā”‚ ā”œā”€ā”€ @types/sinon@10.0.15
ā”‚ ā”œā”€ā”€ auth0@3.4.0
ā”‚ ā”œā”€ā”¬ project-pulumi-common@1.0.0 -> ./common
ā”‚ ā”‚ ā”œā”€ā”€ @kubernetes/client-node@0.18.1 deduped
ā”‚ ā”‚ ā”œā”€ā”€ @pulumi/kubernetes@v3.29.1
ā”‚ ā”‚ ā”œā”€ā”€ @types/auth0@3.3.2 deduped
ā”‚ ā”‚ ā”œā”€ā”€ @types/sinon@10.0.15 deduped
ā”‚ ā”‚ ā”œā”€ā”€ auth0@3.4.0 deduped
ā”‚ ā”‚ ā””ā”€ā”€ sinon@15.0.4 deduped
ā”‚ ā””ā”€ā”€ sinon@15.0.4
ā”œā”€ā”€ eslint-config-prettier@8.6.0
ā”œā”€ā”€ eslint-plugin-import@2.27.5
ā”œā”€ā”€ eslint@8.35.0
ā”œā”€ā”€ js-yaml@4.1.0
ā”œā”€ā”€ lodash@4.17.21
ā”œā”€ā”€ node-fetch@2.6.9
ā”œā”€ā”€ prettier@2.8.4
ā””ā”€ā”€ typescript@4.9.5

Also, we are using the following language and resource plugins:

https://get.pulumi.com/releases/sdk/pulumi-v3.72.1-linux-x64.tar.gz
https://api.pulumi.com/releases/plugins/pulumi-resource-auth0-v2.21.0-linux-amd64.tar.gz
https://api.pulumi.com/releases/plugins/pulumi-resource-gcp-v6.58.0-linux-amd64.tar.gz
https://api.pulumi.com/releases/plugins/pulumi-resource-google-native-v0.31.0-linux-amd64.tar.gz
https://api.pulumi.com/releases/plugins/pulumi-resource-kubernetes-v3.29.1-linux-amd64.tar.gz
https://api.pulumi.com/releases/plugins/pulumi-resource-postgresql-v3.8.0-linux-amd64.tar.gz
https://api.pulumi.com/releases/plugins/pulumi-resource-random-v4.13.2-linux-amd64.tar.gz
https://api.pulumi.com/releases/plugins/pulumi-resource-tls-v4.10.0-linux-amd64.tar.gz
https://api.pulumi.com/releases/plugins/pulumi-resource-vault-v5.11.0-linux-amd64.tar.gz

It would be valuable to know which version of the Kubernetes provider you are using.

That should be the version v3.29.1 above, right? Or were you asking for something different?

cbley-da commented 1 year ago

I have also tried to update all resource plugins to the latest version (especially pulumi-kubernetes to version 3.30.1) but that made no difference.

I also recompiled pulumi-kubernetes, adding:

        _ = r.host.Log(ctx, diag.Warning, urn, fmt.Sprintf("manifest: %v", rel.Manifest))

right before the warning is printed:

        _ = r.host.Log(ctx, diag.Warning, urn, fmt.Sprintf("Helm release %q was created but has a failed status. Use the `helm` command to investigate the error, correct it, then retry. Reason: %v", client.ReleaseName, err))

Running pulumi up again it printed the manifest but it looks completely valid.

This seems to indicate that the error happens further down the code path, whatever tries to realize the manifest... WDYT?

cbley-da commented 1 year ago

Here is a simple manifest that failed with such an error:

---
# Source: postgres/templates/secrets.yaml
apiVersion: v1
kind: Secret
metadata:
  name: "postgres-secrets"
  namespace: ns-1
type: Opaque
data:
  postgresPassword: ***
---
# Source: postgres/templates/postgres.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: "postgres-configuration"
  namespace: ns-1
data:
  PGDATA: "/var/lib/postgresql/data/pgdata"
  POSTGRES_DB: "testdb"
  POSTGRES_USER: "***"
---
# Source: postgres/templates/postgres.yaml
apiVersion: v1
kind: Service
metadata:
  name: postgres
  namespace: ns-1
spec:
  ports:
  - name: postgresdb
    port: 5432
    protocol: TCP
  selector:
    app: postgres
---
# Source: postgres/templates/postgres.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
  namespace: ns-1
  labels:
    app: postgres
spec:
  serviceName: postgres
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
        namespace: ns-1
    spec:
      containers:
      - name: postgres
        image: postgres:14
        imagePullPolicy: IfNotPresent
        args: ["-c", "max_connections=300"]
        env:
        - name: POSTGRES_PASSWORD
          valueFrom:
           secretKeyRef:
              name: "postgres-secrets"
              key: postgresPassword
        envFrom:
        - configMapRef:
            name: "postgres-configuration"
        livenessProbe:
          exec:
            command:
            - psql
            - -U
            - cnadmin
            - -d
            - template1
            - -c
            - SELECT 1
          failureThreshold: 3
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        ports:
          - containerPort: 5432
            name: postgresdb
            protocol: TCP
        resources:
          limits:
            cpu: "2"
            memory: 8Gi
          requests:
            cpu: "2"
            memory: 8Gi
        volumeMounts:
          - mountPath: /var/lib/postgresql/data
            name: pg-data
      restartPolicy: Always
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: pg-data
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 60Gi
      storageClassName: standard-rwo
      volumeMode: Filesystem

Note, this manifest does not even include any istio resources, but we see this error:

error: 1 error occurred:
        * Helm release "ns-1/postgres" was created, but failed to initialize completely. Use Helm CLI to investigate.: failed to become available within allocated timeout. Error: Helm Release ns-1/postgres: 1 error occurred:
        * the server could not find the requested resource (post secrets.networking.istio.io)

The log has:

{
  "protoPayload": {
    "@type": "type.googleapis.com/google.cloud.audit.AuditLog",
    "authenticationInfo": {
      "principalEmail": "***"
    },
    "authorizationInfo": [
      {
        "granted": true,
        "permission": "io.istio.networking.v1alpha3.secrets.create",
        "resource": "networking.istio.io/v1alpha3/namespaces/sv-1/secrets"
      }
    ],
    "methodName": "io.istio.networking.v1alpha3.secrets.create",
    "requestMetadata": {
      "callerIp": "***",
      "callerSuppliedUserAgent": "Go-http-client/2.0"
    },
    "resourceName": "networking.istio.io/v1alpha3/namespaces/sv-1/secrets",
    "serviceName": "k8s.io",
    "status": {
      "code": 5,
      "message": "Not Found"
    }
  },
  "insertId": "***",
  "resource": {
    "type": "k8s_cluster",
    "labels": {
      "project_id": "***",
      "cluster_name": "***",
      "location": "us-central1"
    }
  },
  "timestamp": "2023-07-05T09:28:19.357551Z",
  "labels": {
    "authorization.k8s.io/decision": "allow",
    "authorization.k8s.io/reason": "access granted by IAM permissions."
  },
  "logName": "projects/da-cn-scratchnet/logs/cloudaudit.googleapis.com%2Factivity",
  "operation": {
    "id": "***",
    "producer": "k8s.io",
    "first": true,
    "last": true
  },
  "receiveTimestamp": "2023-07-05T09:28:39.732455336Z"
}
cbley-da commented 1 year ago

After some more debugging :sweat: (I could hardly successfully run pulumi up on my local machine at all -- retrying just failed with another error), I think I have pinpointed the problem...

It's the re-using of the config in https://github.com/kubernetes/cli-runtime/blob/807b4689df02de0db3d6191ee6bca07d6a685b54/pkg/resource/client.go#L34 and https://github.com/kubernetes/cli-runtime/blob/807b4689df02de0db3d6191ee6bca07d6a685b54/pkg/resource/client.go#L50 -- since the clientConfigFn() returns a pointer to a rest.Config this could be the same config returned for concurrent actions (probably related to the usePersistentConfig setting which is enabled by default, AFAICS).

I have added cfg = rest.CopyConfig(cfg) statements to both of these functions and haven't seen the problem again.

I'll report this to the kubernetes folks.

pdf commented 11 months ago

Thank you for tracking down the cause @cbley-da

However, isn't this largely a result of this repo returning the same config pointer for every call? Couldn't we solve this here by doing the following?

diff --git a/provider/pkg/provider/kubeconfig.go b/provider/pkg/provider/kubeconfig.go
index 2dda299f6..06ac09f9b 100644
--- a/provider/pkg/provider/kubeconfig.go
+++ b/provider/pkg/provider/kubeconfig.go
@@ -29,7 +29,7 @@ func (k *KubeConfig) ToDiscoveryClient() (discovery.CachedDiscoveryInterface, er

 // ToRESTConfig implemented interface method
 func (k *KubeConfig) ToRESTConfig() (*rest.Config, error) {
-       return k.restConfig, nil
+       return rest.CopyConfig(k.restConfig), nil
 }

 // ToRESTMapper implemented interface method

That ought to solve the problem for all current and future methods that might be added on the ClientConfigFunc type, without deferring to any requirements that upstream might have.

Thoughts @guineveresaenger ?

ghost commented 9 months ago

Does this mean this is purely a pulumi-kubernetes issue and the Kubernetes client upstream PR is also solved somehow? https://github.com/kubernetes/kubernetes/pull/119199

pdf commented 9 months ago

@davidd-da this solves the same problem locally, without requiring a change upstream.

I'm not entirely certain that the upstream change is necessary, since implementors of ClientConfigFunc like pulumi-kubernetes can decide whether or not they re-use the same config pointer across requests. I suppose it might remove a potential footgun, though some documentation that mentions the pitfall might achieve a similar result.

I don't know enough about the space to know whether some use-case might find it desirable for the config to mutate - I doubt it, but if such a case does exist the upstream change would be breaking.

ghost commented 9 months ago

Is it possible to release 4.5.6 with this fix? I am not sure what the usual cycle is.