Closed cbley-da closed 9 months ago
Hi @cbley-da,
Thank you for the detailed exploration with Cloud Explorer and providing this bug report. I haven't come across this issue before, so it's concerning that the GKNN for resources is incorrect. I've conducted a quick scan through our codebase, but I couldn't identify any obvious code path that might trigger this behavior.
While I continue to investigate further, it would be incredibly helpful if you could provide a repro for this issue. It seems like you are attempting to use Helm to install either Istio or Anthos Service Mesh. Could you also re-run pulumi about
within the Pulumi project directory? This will provide us with more information about the plugins and their versions being used in your Pulumi program. It would be valuable to know which version of the Kubernetes provider you are using.
Thank you once again for bringing this to our attention, and we appreciate your cooperation in helping us resolve this matter.
Hi @rquitales,
thank you for the quick response!
While I continue to investigate further, it would be incredibly helpful if you could provide a repro for this issue.
Trying this now, but maybe it already helps if I describe what we do:
We have multiple Helm charts that we want to install. One of them indeed includes Istio. We install multiple instances of the same chart via a Helm Release by passing different values, e.g.:
function installHelm(
chartName: string,
name: string,
nsName: pulumi.Output<string>,
values: ChartValues = {},
dependsOn: (pulumi.Resource | pulumi.Output<pulumi.Resource>)[] = []
): pulumi.ProviderResource {
return new k8s.helm.v3.Release(
`helm-${prefix}-${name}`,
{
name,
namespace: nsName,
chart: process.env.REPO_ROOT + '/cluster/helm/' + chartName + '/',
values: cnChartValues(chartName, values),
timeout: HELM_CHART_TIMEOUT_SEC,
},
{
dependsOn,
}
);
}
installHelm(chart, name1, ns1, values: { ... })
installHelm(chart, name2, ns2, values: { ... })
installHelm(chart, name3, ns3, values: { ... })
installHelm(chart, name4, ns4, values: { ... })
These are installed in different namespaces and with unique names of course.
Could you also re-run
pulumi about
within the Pulumi project directory? This will provide us with more information about the plugins and their versions being used in your Pulumi program.
I already did (see output above). We are using pulumi from nixpkgs and somehow pulumi about
does not work in this case.
Also, we are using npm workspaces, ie. we have a directory layout as this:
pulumi/
project1/
package.json
src/
index.ts
project2
package.json
src/
index.ts
...
package.json
package-lock.json
So the package-lock.json file is located in the workspace directory, not beside the package.json for each of the projects.
Here's the output of npm ls --include-workspace-root
instead:
npm ls --depth=0 --include-workspace-root
project-pulumi-deployment@1.0.0 /home/claudio/project/cluster/pulumi
āāā @pulumi/gcp@v6.50.0
āāā @pulumi/kubernetes-cert-manager@v0.0.5
āāā @pulumi/pulumi@3.72.0
āāā @trivago/prettier-plugin-sort-imports@3.4.0
āāā @types/js-yaml@4.0.5
āāā @types/lodash@4.14.191
āāā @types/node@14.18.36
āāā @typescript-eslint/eslint-plugin@5.54.0
āāā @typescript-eslint/parser@5.54.0
āāā¬ subproject1-pulumi-deployment@1.0.0 -> ./subproject1
ā āāā @kubernetes/client-node@0.18.1
ā āāā @pulumi/random@v4.13.2
ā āāā @types/auth0@3.3.2
ā āāā @types/sinon@10.0.15
ā āāā auth0@3.4.0
ā āāā¬ project-pulumi-common@1.0.0 -> ./common
ā ā āāā @kubernetes/client-node@0.18.1 deduped
ā ā āāā @pulumi/kubernetes@v3.29.1
ā ā āāā @types/auth0@3.3.2 deduped
ā ā āāā @types/sinon@10.0.15 deduped
ā ā āāā auth0@3.4.0 deduped
ā ā āāā sinon@15.0.4 deduped
ā āāā sinon@15.0.4
āāā eslint-config-prettier@8.6.0
āāā eslint-plugin-import@2.27.5
āāā eslint@8.35.0
āāā js-yaml@4.1.0
āāā lodash@4.17.21
āāā node-fetch@2.6.9
āāā prettier@2.8.4
āāā typescript@4.9.5
Also, we are using the following language and resource plugins:
https://get.pulumi.com/releases/sdk/pulumi-v3.72.1-linux-x64.tar.gz
https://api.pulumi.com/releases/plugins/pulumi-resource-auth0-v2.21.0-linux-amd64.tar.gz
https://api.pulumi.com/releases/plugins/pulumi-resource-gcp-v6.58.0-linux-amd64.tar.gz
https://api.pulumi.com/releases/plugins/pulumi-resource-google-native-v0.31.0-linux-amd64.tar.gz
https://api.pulumi.com/releases/plugins/pulumi-resource-kubernetes-v3.29.1-linux-amd64.tar.gz
https://api.pulumi.com/releases/plugins/pulumi-resource-postgresql-v3.8.0-linux-amd64.tar.gz
https://api.pulumi.com/releases/plugins/pulumi-resource-random-v4.13.2-linux-amd64.tar.gz
https://api.pulumi.com/releases/plugins/pulumi-resource-tls-v4.10.0-linux-amd64.tar.gz
https://api.pulumi.com/releases/plugins/pulumi-resource-vault-v5.11.0-linux-amd64.tar.gz
It would be valuable to know which version of the Kubernetes provider you are using.
That should be the version v3.29.1 above, right? Or were you asking for something different?
I have also tried to update all resource plugins to the latest version (especially pulumi-kubernetes to version 3.30.1) but that made no difference.
I also recompiled pulumi-kubernetes, adding:
_ = r.host.Log(ctx, diag.Warning, urn, fmt.Sprintf("manifest: %v", rel.Manifest))
right before the warning is printed:
_ = r.host.Log(ctx, diag.Warning, urn, fmt.Sprintf("Helm release %q was created but has a failed status. Use the `helm` command to investigate the error, correct it, then retry. Reason: %v", client.ReleaseName, err))
Running pulumi up
again it printed the manifest but it looks completely valid.
This seems to indicate that the error happens further down the code path, whatever tries to realize the manifest... WDYT?
Here is a simple manifest that failed with such an error:
---
# Source: postgres/templates/secrets.yaml
apiVersion: v1
kind: Secret
metadata:
name: "postgres-secrets"
namespace: ns-1
type: Opaque
data:
postgresPassword: ***
---
# Source: postgres/templates/postgres.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: "postgres-configuration"
namespace: ns-1
data:
PGDATA: "/var/lib/postgresql/data/pgdata"
POSTGRES_DB: "testdb"
POSTGRES_USER: "***"
---
# Source: postgres/templates/postgres.yaml
apiVersion: v1
kind: Service
metadata:
name: postgres
namespace: ns-1
spec:
ports:
- name: postgresdb
port: 5432
protocol: TCP
selector:
app: postgres
---
# Source: postgres/templates/postgres.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
namespace: ns-1
labels:
app: postgres
spec:
serviceName: postgres
replicas: 1
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
namespace: ns-1
spec:
containers:
- name: postgres
image: postgres:14
imagePullPolicy: IfNotPresent
args: ["-c", "max_connections=300"]
env:
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: "postgres-secrets"
key: postgresPassword
envFrom:
- configMapRef:
name: "postgres-configuration"
livenessProbe:
exec:
command:
- psql
- -U
- cnadmin
- -d
- template1
- -c
- SELECT 1
failureThreshold: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
ports:
- containerPort: 5432
name: postgresdb
protocol: TCP
resources:
limits:
cpu: "2"
memory: 8Gi
requests:
cpu: "2"
memory: 8Gi
volumeMounts:
- mountPath: /var/lib/postgresql/data
name: pg-data
restartPolicy: Always
volumeClaimTemplates:
- apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pg-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 60Gi
storageClassName: standard-rwo
volumeMode: Filesystem
Note, this manifest does not even include any istio resources, but we see this error:
error: 1 error occurred:
* Helm release "ns-1/postgres" was created, but failed to initialize completely. Use Helm CLI to investigate.: failed to become available within allocated timeout. Error: Helm Release ns-1/postgres: 1 error occurred:
* the server could not find the requested resource (post secrets.networking.istio.io)
The log has:
{
"protoPayload": {
"@type": "type.googleapis.com/google.cloud.audit.AuditLog",
"authenticationInfo": {
"principalEmail": "***"
},
"authorizationInfo": [
{
"granted": true,
"permission": "io.istio.networking.v1alpha3.secrets.create",
"resource": "networking.istio.io/v1alpha3/namespaces/sv-1/secrets"
}
],
"methodName": "io.istio.networking.v1alpha3.secrets.create",
"requestMetadata": {
"callerIp": "***",
"callerSuppliedUserAgent": "Go-http-client/2.0"
},
"resourceName": "networking.istio.io/v1alpha3/namespaces/sv-1/secrets",
"serviceName": "k8s.io",
"status": {
"code": 5,
"message": "Not Found"
}
},
"insertId": "***",
"resource": {
"type": "k8s_cluster",
"labels": {
"project_id": "***",
"cluster_name": "***",
"location": "us-central1"
}
},
"timestamp": "2023-07-05T09:28:19.357551Z",
"labels": {
"authorization.k8s.io/decision": "allow",
"authorization.k8s.io/reason": "access granted by IAM permissions."
},
"logName": "projects/da-cn-scratchnet/logs/cloudaudit.googleapis.com%2Factivity",
"operation": {
"id": "***",
"producer": "k8s.io",
"first": true,
"last": true
},
"receiveTimestamp": "2023-07-05T09:28:39.732455336Z"
}
After some more debugging :sweat: (I could hardly successfully run pulumi up
on my local machine at all -- retrying just failed with another error), I think I have pinpointed the problem...
It's the re-using of the config in https://github.com/kubernetes/cli-runtime/blob/807b4689df02de0db3d6191ee6bca07d6a685b54/pkg/resource/client.go#L34 and https://github.com/kubernetes/cli-runtime/blob/807b4689df02de0db3d6191ee6bca07d6a685b54/pkg/resource/client.go#L50 -- since the clientConfigFn()
returns a pointer to a rest.Config
this could be the same config returned for concurrent actions (probably related to the usePersistentConfig
setting which is enabled by default, AFAICS).
I have added cfg = rest.CopyConfig(cfg)
statements to both of these functions and haven't seen the problem again.
I'll report this to the kubernetes folks.
Thank you for tracking down the cause @cbley-da
However, isn't this largely a result of this repo returning the same config pointer for every call? Couldn't we solve this here by doing the following?
diff --git a/provider/pkg/provider/kubeconfig.go b/provider/pkg/provider/kubeconfig.go
index 2dda299f6..06ac09f9b 100644
--- a/provider/pkg/provider/kubeconfig.go
+++ b/provider/pkg/provider/kubeconfig.go
@@ -29,7 +29,7 @@ func (k *KubeConfig) ToDiscoveryClient() (discovery.CachedDiscoveryInterface, er
// ToRESTConfig implemented interface method
func (k *KubeConfig) ToRESTConfig() (*rest.Config, error) {
- return k.restConfig, nil
+ return rest.CopyConfig(k.restConfig), nil
}
// ToRESTMapper implemented interface method
That ought to solve the problem for all current and future methods that might be added on the ClientConfigFunc
type, without deferring to any requirements that upstream might have.
Thoughts @guineveresaenger ?
Does this mean this is purely a pulumi-kubernetes
issue and the Kubernetes client upstream PR is also solved somehow? https://github.com/kubernetes/kubernetes/pull/119199
@davidd-da this solves the same problem locally, without requiring a change upstream.
I'm not entirely certain that the upstream change is necessary, since implementors of ClientConfigFunc
like pulumi-kubernetes can decide whether or not they re-use the same config pointer across requests. I suppose it might remove a potential footgun, though some documentation that mentions the pitfall might achieve a similar result.
I don't know enough about the space to know whether some use-case might find it desirable for the config to mutate - I doubt it, but if such a case does exist the upstream change would be breaking.
Is it possible to release 4.5.6
with this fix? I am not sure what the usual cycle is.
What happened?
We have some helm charts that we use with pulumi.
When trying to install the releases to our GKE cluster, we sometimes (non-reliably) receive the
the server could not find the requested resource
error.Looking in the messages in the Google Cloud Explorer we can see that it tried to call a method named
io.k8s.core.v1.virtualservices.create
:We have several such entries for that method in our logs over the past 14 days, but all of them fail with error code 5, "the server could not find the requested resource".
But we do have calls to the method
io.istio.networking.v1alpha3.virtualservices.create
which return successfully. This looks to me like some kind of race condition that puts together the wrong API prefix plus resource name and operation.Calling
pulumi up
again usually gets rid of the problem (at least after some retries).We also have more such method calls that fail in the same way, such as
io.k8s.core.v1.deployments.create
(which maybe should have been calls toio.k8s.apps.v1.deployments.create
instead).Expected Behavior
There should be no errors and
pulumi up
should work reliably.Steps to reproduce
This is a bit hard to reproduce. I'll add more details or a minimal reproducer when possible.
Output of
pulumi about
CLI
Version 3.72.1 Go Version go1.20.5 Go Compiler gc
Plugins NAME VERSION nodejs unknown
Host
OS nixos Version 23.05 (Stoat) Arch x86_64
This project is written in nodejs: executable='/nix/store/p0f8i04zwf1dd66n2qkazk5x0fbsy7mp-nodejs-18.16.1/bin/node' version='v18.16.1'
Backend
Name x1 URL gs://... User claudio Organizations
Additional context
No response
Contributing
Vote on this issue by adding a š reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).