pulumi / pulumi-kubernetes

A Pulumi resource provider for Kubernetes to manage API resources and workloads in running clusters
https://www.pulumi.com/docs/reference/clouds/kubernetes/
Apache License 2.0
397 stars 113 forks source link

Refresh fails to forget missing resources, which may exist after failed deploy #3089

Open jan-hudec opened 1 week ago

jan-hudec commented 1 week ago

What happened?

After a failed deployment that involved re-creating some deployments, I'm getting errors like:

  kubernetes:apps/v1:Deployment (deployment):
    error: Preview failed: update of resource "urn:pulumi:stack::project::kubernetes:apps/v1:Deployment::deployment" failed: deployments.apps "deployment" not found
    unable to get cluster state: deployments.apps "deployment" not found

This error happens with both pulumi up and pulumi up --refresh, while pulumi refresh alone does not produce it, but does not fix the state either.

This is a combination of two issues:

Example

Output of pulumi about

CLI          
Version      3.121.0
Go Version   go1.22.4
Go Compiler  gc

Plugins
KIND      NAME          VERSION
resource  azure-native  2.43.1
resource  azuread       5.50.0
resource  docker        4.5.4
resource  kubernetes    4.12.0
language  python        unknown
resource  random        4.16.2

Host     
OS       debian
Version  12.5
Arch     x86_64

This project is written in python: executable='/workspaces/tolion-portal/deployment/venv/bin/python' version='3.11.2'

Current Stack: organization/tolion-deployment/dev-ne-jahu1

TYPE                                             URN
pulumi:pulumi:Stack                              urn:pulumi:dev-ne-jahu1::tolion-deployment::pulumi:pulumi:Stack::tolion-deployment-dev-ne-jahu1
pulumi:providers:pulumi                          urn:pulumi:dev-ne-jahu1::tolion-deployment::pulumi:providers:pulumi::default
pulumi:pulumi:StackReference                     urn:pulumi:dev-ne-jahu1::tolion-deployment::pulumi:pulumi:StackReference::organization/tolion-infra/dev-ne
pulumi:providers:azure-native                    urn:pulumi:dev-ne-jahu1::tolion-deployment::pulumi:providers:azure-native::default_2_43_1
pulumi:providers:kubernetes                      urn:pulumi:dev-ne-jahu1::tolion-deployment::pulumi:providers:kubernetes::aks01
kubernetes:core/v1:Namespace                     urn:pulumi:dev-ne-jahu1::tolion-deployment::kubernetes:core/v1:Namespace::env-jahu1
pulumi:providers:random                          urn:pulumi:dev-ne-jahu1::tolion-deployment::pulumi:providers:random::default_4_16_2
kubernetes:apps/v1:Deployment                    urn:pulumi:dev-ne-jahu1::tolion-deployment::kubernetes:apps/v1:Deployment::redis
pulumi:providers:docker                          urn:pulumi:dev-ne-jahu1::tolion-deployment::pulumi:providers:docker::default_4_5_4
random:index/randomPassword:RandomPassword       urn:pulumi:dev-ne-jahu1::tolion-deployment::random:index/randomPassword:RandomPassword::rag-shared-secret
kubernetes:core/v1:Service                       urn:pulumi:dev-ne-jahu1::tolion-deployment::kubernetes:core/v1:Service::redis
azure-native:storage:StorageAccount              urn:pulumi:dev-ne-jahu1::tolion-deployment::azure-native:storage:StorageAccount::tolionst
azure-native:documentdb:SqlResourceSqlContainer  urn:pulumi:dev-ne-jahu1::tolion-deployment::azure-native:documentdb:SqlResourceSqlContainer::user-data
azure-native:storage:BlobContainer               urn:pulumi:dev-ne-jahu1::tolion-deployment::azure-native:storage:BlobContainer::risk-factors
azure-native:storage:BlobContainer               urn:pulumi:dev-ne-jahu1::tolion-deployment::azure-native:storage:BlobContainer::med-resources
azure-native:storage:BlobContainer               urn:pulumi:dev-ne-jahu1::tolion-deployment::azure-native:storage:BlobContainer::knowledge-engine-outputs
azure-native:storage:BlobContainer               urn:pulumi:dev-ne-jahu1::tolion-deployment::azure-native:storage:BlobContainer::med-cards
pulumi:providers:pulumi-python                   urn:pulumi:dev-ne-jahu1::tolion-deployment::pulumi:providers:pulumi-python::default
kubernetes:core/v1:Secret                        urn:pulumi:dev-ne-jahu1::tolion-deployment::kubernetes:core/v1:Secret::api-services-vars
kubernetes:core/v1:Secret                        urn:pulumi:dev-ne-jahu1::tolion-deployment::kubernetes:core/v1:Secret::rag-env-vars
docker:index/image:Image                         urn:pulumi:dev-ne-jahu1::tolion-deployment::docker:index/image:Image::api-services
docker:index/image:Image                         urn:pulumi:dev-ne-jahu1::tolion-deployment::docker:index/image:Image::mobile-app-web
docker:index/image:Image                         urn:pulumi:dev-ne-jahu1::tolion-deployment::docker:index/image:Image::rag-engine
pulumi:providers:azure-native                    urn:pulumi:dev-ne-jahu1::tolion-deployment::pulumi:providers:azure-native::default_2_47_1
kubernetes:apps/v1:Deployment                    urn:pulumi:dev-ne-jahu1::tolion-deployment::kubernetes:apps/v1:Deployment::mobile-app-web
kubernetes:core/v1:Service                       urn:pulumi:dev-ne-jahu1::tolion-deployment::kubernetes:core/v1:Service::mobile-app-web
kubernetes:networking.k8s.io/v1:Ingress          urn:pulumi:dev-ne-jahu1::tolion-deployment::kubernetes:networking.k8s.io/v1:Ingress::mobile-app-web
kubernetes:apps/v1:Deployment                    urn:pulumi:dev-ne-jahu1::tolion-deployment::kubernetes:apps/v1:Deployment::api-services
kubernetes:core/v1:Service                       urn:pulumi:dev-ne-jahu1::tolion-deployment::kubernetes:core/v1:Service::api-services
kubernetes:networking.k8s.io/v1:Ingress          urn:pulumi:dev-ne-jahu1::tolion-deployment::kubernetes:networking.k8s.io/v1:Ingress::api-services
kubernetes:apps/v1:Deployment                    urn:pulumi:dev-ne-jahu1::tolion-deployment::kubernetes:apps/v1:Deployment::rag-web
kubernetes:apps/v1:Deployment                    urn:pulumi:dev-ne-jahu1::tolion-deployment::kubernetes:apps/v1:Deployment::rag-celery
kubernetes:core/v1:Service                       urn:pulumi:dev-ne-jahu1::tolion-deployment::kubernetes:core/v1:Service::rag-web
kubernetes:networking.k8s.io/v1:Ingress          urn:pulumi:dev-ne-jahu1::tolion-deployment::kubernetes:networking.k8s.io/v1:Ingress::rag-web

Found no pending operations associated with dev-ne-jahu1

Backend        
Name           ce9359666086
URL            azblob://pulumi?storage_account=tolionfoundationstatesa
User           vscode
Organizations  
Token type     personal

Dependencies:
NAME                 VERSION
azure-cosmos         4.7.0
azure-identity       1.17.0
azure-storage-blob   12.20.0
pip                  24.0
pulumi_azure_native  2.43.1
pulumi_azuread       5.50.0
pulumi_docker        4.5.4
pulumi_kubernetes    4.12.0
pulumi_random        4.16.2
python-dotenv        1.0.1
setuptools           70.0.0
wheel                0.43.0

Pulumi locates its logs in /tmp by default

Additional context

The state-mishandling part due to which it happened to me (already a second time) is probably an issue in pulumi itself, but the kubernetes plugin should still be able to correctly record that something disappeared from the server rather than fail.

This is different from the delete_unreachable and skip_update_unreachable options of the provider as documented, because the cluster is reachable just fine, just something else (either pulumi error or unrelated process) deleted the resources from it, while those options are documented as related to the cluster itself being unreachable.

Contributing

Vote on this issue by adding a 👍 reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

rquitales commented 1 week ago

Sorry to hear you're experiencing this issue. I am unable to reproduce this issue with Pulumi CLI v3.122.0 and pulumi-kubernetes v4.12.0.

After a refresh event, my local state updates accurately to show that the deployment has been deleted on the cluster. Running pulumi up --refresh and pulumi refresh both correctly update the local state without errors, allowing the deployment to be recreated.

If the local state isn't refreshed to reflect the out-of-band deletion of the on-cluster deployment, it's expected that the preview would fail with the error you mentioned. This is because, by default, we perform a server-side dry-run during previews, which would fail in this case.

It seems the root issue here is that your state isn't being updated correctly. Before transferring this issue to pu/pu, could you please provide a minimal code reproduction to ensure I fully understand the problem you're facing?

Thanks!