storage_directives: Cannot process filesystems

juju / terraform-provider-juju

A Terraform provider for Juju

Apache License 2.0

20 stars 37 forks source link

storage_directives: Cannot process filesystems #537

Closed hemanthnakkina closed 1 month ago

hemanthnakkina commented 2 months ago

Description

sunbeam-terraform plan when ran with juju plugin 0.13.0 fails with cannot process filesystems. The failure is not consistent but happened in 2 out of 3 builds.

Urgency

Blocker for our release

Terraform Juju Provider version

0.13.0

Terraform version

1.9.1

Juju version

3.5.2

Terraform Configuration(s)

git clone https://github.com/canonical/sunbeam-terraform.git
cd sunbeam-terraform
git fetch origin pull/83/head:BRANCH_NAME

Reproduce / Test

juju 3.5/stable + microk8s 

sudo snap install --classic terraform

git clone https://github.com/canonical/sunbeam-terraform.git
cd sunbeam-terraform
git fetch origin pull/83/head:BRANCH_NAME
git checkout BNRACH_NAME

echo '{"credential": "microk8s", "enable-vault": true, "enable-barbican": true, "enable-designate": true, "nameservers": "testing.github.", "mysql-storage": { "database": "12G" }, , "keystone-storage": { "fernet-keys": "6M", "credential-keys": "8M" } }' > terraform.tfvars.json
terraform init
terraform apply -auto-approve -var-file=channels/edge.tfvars

Debug/Panic Output

╷
│ Error: Client Error
│ 
│   with module.cinder-ceph.juju_application.service,
│   on modules/openstack-api/main.tf line 27, in resource "juju_application" "service":
│   27: resource "juju_application" "service" {
│ 
│ Unable to read application, got error: cannot process volumes: cannot
│ convert volume details for volume-13: unit unit-glance-0 not found (not
│ found)
╵
Error: Process completed with exit code 1.

Notes & References

https://github.com/canonical/sunbeam-terraform/actions/runs/10140842506/job/28040230530?pr=83

I will try to come up with smaller plan to reproduce. I have added details to provide early report.

hmlanigan commented 2 months ago

A smaller reproducer would be helpful here. This may be a juju bug, or something else.

anvial commented 2 months ago

Potential duplication with #534

hemanthnakkina commented 2 months ago

@hmlanigan @anvial Unfortunately i am not able to reproduce with small set of plan.

I tried the above setup with following terraform vars combiantion: echo '{"credential": "microk8s" }' > terraform.tfvars.json echo '{"credential": "microk8s", "enable-vault": true, "enable-barbican": true, "enable-designate": true, "nameservers": "testing.github." }' > terraform.tfvars.json

And the issue is reproducible sometimes. Please note in the new tests today, I do not have any storage_directives defined from terraform vars and is set to {}.

Also I observed the plugin does not create all applications in parallel in 0.13.0. The plugin creates 7 applications wait for some logic (the apps did not come to active from juju perspective) and then create rest of the applications. Can you elaborate if there is some kind of dependency check introduced to deploy apps in the plugin in 0.12.0 or 0.13.0?

Also I observed the terraform state for each application contains storage details of all the applications created in this plan. The bind application does not have any storage requirements.

$ terraform state show juju_application.bind[0]
# juju_application.bind[0]:
resource "juju_application" "bind" {
    config      = {}
    constraints = "arch=amd64"
    id          = "openstack:bind"
    model       = "openstack"
    name        = "bind"
    placement   = ""
    storage     = [
        {
            count = 1
            label = "certs-10"
            pool  = "kubernetes"
            size  = "5M"
        },
        {
            count = 1
            label = "config-11"
            pool  = "kubernetes"
            size  = "5M"
        },
        {
            count = 1
            label = "configurations-1"
            pool  = "kubernetes"
            size  = "1G"
        },
        {
            count = 1
            label = "configurations-2"
            pool  = "kubernetes"
            size  = "1G"
        },
        {
            count = 1
            label = "database-9"
            pool  = "kubernetes"
            size  = "1G"
        },
        {
            count = 1
            label = "databases-4"
            pool  = "kubernetes"
            size  = "100M"
        },
        {
            count = 1
            label = "rabbitmq-data"
            pool  = "kubernetes"
            size  = "1G"
        },
        {
            count = 1
            label = "tmp-12"
            pool  = "kubernetes"
            size  = "5.1G"
        },
        {
            count = 1
            label = "vault-raft-13"
            pool  = "kubernetes"
            size  = "10G"
        },
    ]
    trust       = false
    units       = 1

    charm {
        base     = "ubuntu@22.04"
        channel  = "9/edge"
        name     = "designate-bind-k8s"
        revision = 35
        series   = "jammy"
    }
}

hemanthnakkina commented 2 months ago

One more observation, in the above example out of 120 resources 10 resources are getting created in first run as expected (because of default parallelism 10). 1 resource - creation of model 5 resources - creation of application with storage requirement (defaults or set via terraform plan) 4 resources - creation of application with no storage requirement for the charm

9 resources of application creation are not completed until storage is created for all the 5 resoures/applications. Cross checking with k8s host-provisioner pod logs shows PVs are created much early for most of applications.

Expectation is not all 9 applications should wait until storage for all applications are created.

From terraform apply console logs, below table shows resource name creation_time created_successfully_time

                                                             juju_model.sunbeam: [2024-08-02 14:26:05.492143] [2024-08-02 14:26:05.632077]
                                  module.single-mysql[0].juju_application.mysql: [2024-08-02 14:26:05.660417] [2024-08-02 14:34:41.761684]
                                       module.ovn.juju_application.ovn-relay[0]: [2024-08-02 14:26:05.663018] [2024-08-02 14:34:41.509735]
                                         juju_application.certificate-authority: [2024-08-02 14:26:05.686249] [2024-08-02 14:34:41.415593]
                                                juju_application.traefik-public: [2024-08-02 14:26:05.688072] [2024-08-02 14:34:42.175205]
                                                       juju_application.bind[0]: [2024-08-02 14:26:05.688924] [2024-08-02 14:34:42.073748]
                                                      juju_application.vault[0]: [2024-08-02 14:26:05.707729] [2024-08-02 14:34:41.368112]
                                        module.ovn.juju_application.ovn-central: [2024-08-02 14:26:05.713097] [2024-08-02 14:34:42.194889]
                                      module.rabbitmq.juju_application.rabbitmq: [2024-08-02 14:26:05.713190] [2024-08-02 14:34:42.065555]
                                                       juju_application.traefik: [2024-08-02 14:26:05.735891] [2024-08-02 14:34:42.200672]

                                                            juju_offer.ca-offer: [2024-08-02 14:34:41.426497] [2024-08-02 14:34:46.585169]
                                       module.ovn.juju_offer.ovn-relay-offer[0]: [2024-08-02 14:34:41.523196] [2024-08-02 14:34:46.694237]
                                 module.ovn.juju_integration.ovn-relay-to-ca[0]: [2024-08-02 14:34:41.536230] [2024-08-02 14:34:46.878869]
                                   module.neutron.juju_application.mysql-router: [2024-08-02 14:34:41.795447] [2024-08-02 14:34:43.714353]
                                  module.keystone.juju_application.mysql-router: [2024-08-02 14:34:41.802773] [2024-08-02 14:34:43.805514]
                                       module.keystone.juju_application.service: [2024-08-02 14:34:42.084648] [2024-08-02 14:36:53.757169]
.......
.......

Corresponding Hostpath provisioner log attached hostpath-provisioner.log

hemanthnakkina commented 2 months ago

I have raised https://github.com/juju/terraform-provider-juju/issues/539 to discuss above comment

hemanthnakkina commented 2 months ago

I am able to reproduce the Client Error with a smaller terraform plan mentioned in https://github.com/juju/terraform-provider-juju/issues/539. However the reproducer is not consistent, I got the error in the 4th run (each time i destroyed the terraform plan, removed the juju model manually and started afresh)

Here are the logs attached. tfapplyconsoleout.log tfapplydebug.log hostpath-provisioner.log

As described in https://github.com/juju/terraform-provider-juju/issues/539, i ran with a patch with some printouts. prints.patch.txt

Looking at the logs, it seems the error occurs when an application (that has no storage requirement) is created before storage provisioning of any application is started.

As per tfapplyconsoleout.log, bind application creation started at 2024-08-05 04:25:29.541888 but not completed due to the error.

│ Error: Client Error
│
│   with juju_application.bind,
│   on main.tf line 116, in resource "juju_application" "bind":
│  116: resource "juju_application" "bind" {
│
│ Unable to read application, got error: cannot process volumes: cannot convert volume details for volume-4: unit unit-keystone-0 not found (not found)

tfapplydebug.log shows

2024-08-05T04:25:31.348Z [ERROR] provider.terraform-provider-juju_v0.14.0: failed to get status: @caller=/home/ubuntu/terraform-provider-juju/internal/juju/client.go:239 @module=juju.client error="cannot process volumes: cannot convert volume details for volume-4: unit unit-keystone-0 not found (not found)" timestamp=2024-08-05T04:25:31.348Z
2024-08-05T04:25:31.349Z [ERROR] provider.terraform-provider-juju_v0.14.0: Response contains error diagnostic: diagnostic_summary="Client Error" tf_provider_addr=registry.terraform.io/juju/juju tf_resource_type=juju_application tf_rpc=ApplyResourceChange @caller=/home/ubuntu/go/pkg/mod/github.com/hashicorp/terraform-plugin-go@v0.23.0/tfprotov6/internal/diag/diagnostics.go:58 diagnostic_severity=ERROR diagnostic_detail="Unable to read application, got error: cannot process volumes: cannot convert volume details for volume-4: unit unit-keystone-0 not found (not found)" tf_proto_version=6.6 tf_req_id=cbad6cb8-6ddb-3650-8d5b-f54463d11489 @module=sdk.proto timestamp=2024-08-05T04:25:31.349Z
2024-08-05T04:25:31.354Z [DEBUG] State storage *statemgr.Filesystem declined to persist a state snapshot
2024-08-05T04:25:31.354Z [ERROR] vertex "juju_application.bind" error: Client Error

hostpath provisioner for this terraform plan started at 04:25:42.152508 where as bind application Client Error is at 2024-08-05T04:25:31.354Z

hemanthnakkina commented 2 months ago

I retested the terraform plan mentioned in above comment with the terraform-provider-juju main branch and got similar error as above in 1 out of 7 runs.

Logs attached: tfapplydebug_main.log tfapplyconsoleout_main.log

hemanthnakkina commented 2 months ago

Retested again on main branch with nightly build of 20 times in a loop and the reported issue is not observed. If i include above comment, I am able to reproduce once in almost like 40 runs.

Let me know how should we proceed given on main branch, the bug is very rare to reproduce.

However one thing i would like to get resolved is on the below DEBUG statements on why waiting for application A1 has retry errors for storage of application A2 which seems something wrong fundamentally.

2024-08-07T23:10:56.901Z [DEBUG] provider.terraform-provider-juju_v0.14.0: waiting for application "vault": @module=juju.client err="retrying: storage label \"certs\" missing detail" @caller=/home/ubuntu/terraform-provider-juju/internal/juju/client.go:235 timestamp=2024-08-07T23:10:56.900Z
2024-08-07T23:10:57.080Z [DEBUG] provider.terraform-provider-juju_v0.14.0: waiting for application "keystone-mysql-router": err="retrying: storage label \"tmp-2\" missing detail" @caller=/home/ubuntu/terraform-provider-juju/internal/juju/client.go:235 @module=juju.client timestamp=2024-08-07T23:10:57.080Z
2024-08-07T23:10:57.206Z [DEBUG] provider.terraform-provider-juju_v0.14.0: waiting for application "keystone": @caller=/home/ubuntu/terraform-provider-juju/internal/juju/client.go:235 @module=juju.client err="retrying: storage label \"database-6\" missing detail" timestamp=2024-08-07T23:10:57.206Z
2024-08-07T23:10:57.513Z [DEBUG] provider.terraform-provider-juju_v0.14.0: waiting for application "mysql": err="retrying: storage label \"credential-keys-4\" missing detail" @caller=/home/ubuntu/terraform-provider-juju/internal/juju/client.go:235 @module=juju.client timestamp=2024-08-07T23:10:57.512Z

anvial commented 1 month ago

Hey, @hemanthnakkina, I am trying to reproduce this bug but cannot observe it.

You mentioned that you used a small reprocuder from #539, is it still actual?

anvial commented 1 month ago

Hey, @hemanthnakkina, I am trying to reproduce this bug but cannot observe it.

You mentioned that you used a small reprocuder from #539, is it still actual?

Asking because I sevel times deployed (this plan) with the results:

└─▪ juju status --storage
Model      Controller  Cloud/Region        Version  SLA          Timestamp
openstack  microk8s    microk8s/localhost  3.5.3    unsupported  12:17:09+03:00

App                    Version                  Status   Scale  Charm               Channel      Rev  Address         Exposed  Message
bind                                            active       1  designate-bind-k8s  9/stable       4  10.152.183.172  no
keystone                                        blocked      1  keystone-k8s        2024.1/edge  201  10.152.183.90   no       (ingress-public) integration missing
keystone-mysql-router  8.0.35-0ubuntu0.22.04.1  active       1  mysql-router-k8s    8.0/stable    96  10.152.183.105  no
mysql                  8.0.36-0ubuntu0.22.04.1  active       1  mysql-k8s           8.0/stable   153  10.152.183.41   no
vault                                           active       1  vault-k8s           1.15/edge     61  10.152.183.162  no

Unit                      Workload  Agent  Address      Ports  Message
bind/0*                   active    idle   10.1.52.250
keystone-mysql-router/0*  active    idle   10.1.52.243
keystone/0*               blocked   idle   10.1.52.221         (ingress-public) integration missing
mysql/0*                  active    idle   10.1.52.219         Primary
vault/0*                  active    idle   10.1.52.210

Storage Unit  Storage ID         Type        Pool        Mountpoint                               Size     Status    Message
keystone/0    credential-keys/5  filesystem  kubernetes  /var/lib/juju/storage/credential-keys/0  5.0 MiB  attached  Successfully provisioned volume pvc-b09a1664-1a18-4d5c-aea1-6f9bc201bf6f
keystone/0    fernet-keys/6      filesystem  kubernetes  /var/lib/juju/storage/fernet-keys/0      5.0 MiB  attached  Successfully provisioned volume pvc-d019e957-1333-47f4-9daf-14b7738fe96e
mysql/0       database/4         filesystem  kubernetes  /var/lib/juju/storage/database/0         1.0 GiB  attached  Successfully provisioned volume pvc-eb79f498-3481-4d37-beed-e45088e474b6
vault/0       certs/0            filesystem  kubernetes  /var/lib/juju/storage/certs/0            5.0 MiB  attached  Successfully provisioned volume pvc-0e49b998-ed8a-423c-8c9c-7acdf2a3299c
vault/0       config/1           filesystem  kubernetes  /var/lib/juju/storage/config/0           5.0 MiB  attached  Successfully provisioned volume pvc-8438101c-dd1a-4793-b39e-cb93502c8031
vault/0       tmp/2              filesystem  kubernetes  /var/lib/juju/storage/tmp/0              5.0 GiB  attached  Successfully provisioned volume pvc-b5f06a1f-a1c0-4d18-8f7b-58a8ba28acf9
vault/0       vault-raft/3       filesystem  kubernetes  /var/lib/juju/storage/vault-raft/0       10 GiB   attached  Successfully provisioned volume pvc-35309c9c-e93d-407f-b9bc-9ab96a8cd996

So cannot catch the bug.

hemanthnakkina commented 1 month ago

@anvial As I mentioned in my previous comments [1] and [2], I am able to reproduce the problem only 1 out of total 40 times with the PR included in the tests and so I am happy to close the issue.

I would appreciate if you can look at logs mentioned in [1] and if possible form an hypothetical scenario that might cause the issue and add any more logs if required in code so that it will be helpful if problem is reproduced again.

Either way please feel free to close the issue. Appreciate if minor release of terraform-provider-juju that includes PR can be planned sonner.

[1] https://github.com/juju/terraform-provider-juju/issues/537#issuecomment-2274503369 [2] https://github.com/juju/terraform-provider-juju/issues/537#issuecomment-2277150424

anvial commented 1 month ago

@hemanthnakkina got you, I would try to do some scripting in case to catch the but.

If you find a more reliable way to catch this annoying thing, plz let me know.

hemanthnakkina commented 1 month ago

@anvial I will suggest not to waste on your time in scripting to catch the bug as I already invested time to run overnight continuously. Just take a look at logs if it provide any hints on what could have happened in error case. Lets close the case if you are not able to find anything from logs.

anvial commented 1 month ago

Ok, let's close this bug. We will reopen it in case we find a way how to reproduce it more reliably.