Closed hemanthnakkina closed 1 month ago
A smaller reproducer would be helpful here. This may be a juju bug, or something else.
Potential duplication with #534
@hmlanigan @anvial Unfortunately i am not able to reproduce with small set of plan.
I tried the above setup with following terraform vars combiantion:
echo '{"credential": "microk8s" }' > terraform.tfvars.json
echo '{"credential": "microk8s", "enable-vault": true, "enable-barbican": true, "enable-designate": true, "nameservers": "testing.github." }' > terraform.tfvars.json
And the issue is reproducible sometimes. Please note in the new tests today, I do not have any storage_directives defined from terraform vars and is set to {}.
Also I observed the plugin does not create all applications in parallel in 0.13.0. The plugin creates 7 applications wait for some logic (the apps did not come to active from juju perspective) and then create rest of the applications. Can you elaborate if there is some kind of dependency check introduced to deploy apps in the plugin in 0.12.0 or 0.13.0?
Also I observed the terraform state for each application contains storage details of all the applications created in this plan. The bind application does not have any storage requirements.
$ terraform state show juju_application.bind[0]
# juju_application.bind[0]:
resource "juju_application" "bind" {
config = {}
constraints = "arch=amd64"
id = "openstack:bind"
model = "openstack"
name = "bind"
placement = ""
storage = [
{
count = 1
label = "certs-10"
pool = "kubernetes"
size = "5M"
},
{
count = 1
label = "config-11"
pool = "kubernetes"
size = "5M"
},
{
count = 1
label = "configurations-1"
pool = "kubernetes"
size = "1G"
},
{
count = 1
label = "configurations-2"
pool = "kubernetes"
size = "1G"
},
{
count = 1
label = "database-9"
pool = "kubernetes"
size = "1G"
},
{
count = 1
label = "databases-4"
pool = "kubernetes"
size = "100M"
},
{
count = 1
label = "rabbitmq-data"
pool = "kubernetes"
size = "1G"
},
{
count = 1
label = "tmp-12"
pool = "kubernetes"
size = "5.1G"
},
{
count = 1
label = "vault-raft-13"
pool = "kubernetes"
size = "10G"
},
]
trust = false
units = 1
charm {
base = "ubuntu@22.04"
channel = "9/edge"
name = "designate-bind-k8s"
revision = 35
series = "jammy"
}
}
One more observation, in the above example out of 120 resources 10 resources are getting created in first run as expected (because of default parallelism 10). 1 resource - creation of model 5 resources - creation of application with storage requirement (defaults or set via terraform plan) 4 resources - creation of application with no storage requirement for the charm
9 resources of application creation are not completed until storage is created for all the 5 resoures/applications. Cross checking with k8s host-provisioner pod logs shows PVs are created much early for most of applications.
Expectation is not all 9 applications should wait until storage for all applications are created.
From terraform apply console logs, below table shows resource name creation_time created_successfully_time
juju_model.sunbeam: [2024-08-02 14:26:05.492143] [2024-08-02 14:26:05.632077]
module.single-mysql[0].juju_application.mysql: [2024-08-02 14:26:05.660417] [2024-08-02 14:34:41.761684]
module.ovn.juju_application.ovn-relay[0]: [2024-08-02 14:26:05.663018] [2024-08-02 14:34:41.509735]
juju_application.certificate-authority: [2024-08-02 14:26:05.686249] [2024-08-02 14:34:41.415593]
juju_application.traefik-public: [2024-08-02 14:26:05.688072] [2024-08-02 14:34:42.175205]
juju_application.bind[0]: [2024-08-02 14:26:05.688924] [2024-08-02 14:34:42.073748]
juju_application.vault[0]: [2024-08-02 14:26:05.707729] [2024-08-02 14:34:41.368112]
module.ovn.juju_application.ovn-central: [2024-08-02 14:26:05.713097] [2024-08-02 14:34:42.194889]
module.rabbitmq.juju_application.rabbitmq: [2024-08-02 14:26:05.713190] [2024-08-02 14:34:42.065555]
juju_application.traefik: [2024-08-02 14:26:05.735891] [2024-08-02 14:34:42.200672]
juju_offer.ca-offer: [2024-08-02 14:34:41.426497] [2024-08-02 14:34:46.585169]
module.ovn.juju_offer.ovn-relay-offer[0]: [2024-08-02 14:34:41.523196] [2024-08-02 14:34:46.694237]
module.ovn.juju_integration.ovn-relay-to-ca[0]: [2024-08-02 14:34:41.536230] [2024-08-02 14:34:46.878869]
module.neutron.juju_application.mysql-router: [2024-08-02 14:34:41.795447] [2024-08-02 14:34:43.714353]
module.keystone.juju_application.mysql-router: [2024-08-02 14:34:41.802773] [2024-08-02 14:34:43.805514]
module.keystone.juju_application.service: [2024-08-02 14:34:42.084648] [2024-08-02 14:36:53.757169]
.......
.......
Corresponding Hostpath provisioner log attached hostpath-provisioner.log
I have raised https://github.com/juju/terraform-provider-juju/issues/539 to discuss above comment
I am able to reproduce the Client Error
with a smaller terraform plan mentioned in https://github.com/juju/terraform-provider-juju/issues/539. However the reproducer is not consistent, I got the error in the 4th run (each time i destroyed the terraform plan, removed the juju model manually and started afresh)
Here are the logs attached. tfapplyconsoleout.log tfapplydebug.log hostpath-provisioner.log
As described in https://github.com/juju/terraform-provider-juju/issues/539, i ran with a patch with some printouts. prints.patch.txt
Looking at the logs, it seems the error occurs when an application (that has no storage requirement) is created before storage provisioning of any application is started.
bind
application creation started at 2024-08-05 04:25:29.541888
but not completed due to the error.
│ Error: Client Error
│
│ with juju_application.bind,
│ on main.tf line 116, in resource "juju_application" "bind":
│ 116: resource "juju_application" "bind" {
│
│ Unable to read application, got error: cannot process volumes: cannot convert volume details for volume-4: unit unit-keystone-0 not found (not found)
2024-08-05T04:25:31.348Z [ERROR] provider.terraform-provider-juju_v0.14.0: failed to get status: @caller=/home/ubuntu/terraform-provider-juju/internal/juju/client.go:239 @module=juju.client error="cannot process volumes: cannot convert volume details for volume-4: unit unit-keystone-0 not found (not found)" timestamp=2024-08-05T04:25:31.348Z
2024-08-05T04:25:31.349Z [ERROR] provider.terraform-provider-juju_v0.14.0: Response contains error diagnostic: diagnostic_summary="Client Error" tf_provider_addr=registry.terraform.io/juju/juju tf_resource_type=juju_application tf_rpc=ApplyResourceChange @caller=/home/ubuntu/go/pkg/mod/github.com/hashicorp/terraform-plugin-go@v0.23.0/tfprotov6/internal/diag/diagnostics.go:58 diagnostic_severity=ERROR diagnostic_detail="Unable to read application, got error: cannot process volumes: cannot convert volume details for volume-4: unit unit-keystone-0 not found (not found)" tf_proto_version=6.6 tf_req_id=cbad6cb8-6ddb-3650-8d5b-f54463d11489 @module=sdk.proto timestamp=2024-08-05T04:25:31.349Z
2024-08-05T04:25:31.354Z [DEBUG] State storage *statemgr.Filesystem declined to persist a state snapshot
2024-08-05T04:25:31.354Z [ERROR] vertex "juju_application.bind" error: Client Error
04:25:42.152508
where as bind application Client Error
is at 2024-08-05T04:25:31.354Z
I retested the terraform plan mentioned in above comment with the terraform-provider-juju main branch and got similar error as above in 1 out of 7 runs
.
Logs attached: tfapplydebug_main.log tfapplyconsoleout_main.log
Retested again on main branch with nightly build of 20 times in a loop and the reported issue is not observed. If i include above comment, I am able to reproduce once in almost like 40 runs.
Let me know how should we proceed given on main branch, the bug is very rare to reproduce.
However one thing i would like to get resolved is on the below DEBUG statements on why waiting for application A1 has retry errors for storage of application A2
which seems something wrong fundamentally.
2024-08-07T23:10:56.901Z [DEBUG] provider.terraform-provider-juju_v0.14.0: waiting for application "vault": @module=juju.client err="retrying: storage label \"certs\" missing detail" @caller=/home/ubuntu/terraform-provider-juju/internal/juju/client.go:235 timestamp=2024-08-07T23:10:56.900Z
2024-08-07T23:10:57.080Z [DEBUG] provider.terraform-provider-juju_v0.14.0: waiting for application "keystone-mysql-router": err="retrying: storage label \"tmp-2\" missing detail" @caller=/home/ubuntu/terraform-provider-juju/internal/juju/client.go:235 @module=juju.client timestamp=2024-08-07T23:10:57.080Z
2024-08-07T23:10:57.206Z [DEBUG] provider.terraform-provider-juju_v0.14.0: waiting for application "keystone": @caller=/home/ubuntu/terraform-provider-juju/internal/juju/client.go:235 @module=juju.client err="retrying: storage label \"database-6\" missing detail" timestamp=2024-08-07T23:10:57.206Z
2024-08-07T23:10:57.513Z [DEBUG] provider.terraform-provider-juju_v0.14.0: waiting for application "mysql": err="retrying: storage label \"credential-keys-4\" missing detail" @caller=/home/ubuntu/terraform-provider-juju/internal/juju/client.go:235 @module=juju.client timestamp=2024-08-07T23:10:57.512Z
Hey, @hemanthnakkina, I am trying to reproduce this bug but cannot observe it.
You mentioned that you used a small reprocuder from #539, is it still actual?
Hey, @hemanthnakkina, I am trying to reproduce this bug but cannot observe it.
You mentioned that you used a small reprocuder from #539, is it still actual?
Asking because I sevel times deployed (this plan) with the results:
└─▪ juju status --storage
Model Controller Cloud/Region Version SLA Timestamp
openstack microk8s microk8s/localhost 3.5.3 unsupported 12:17:09+03:00
App Version Status Scale Charm Channel Rev Address Exposed Message
bind active 1 designate-bind-k8s 9/stable 4 10.152.183.172 no
keystone blocked 1 keystone-k8s 2024.1/edge 201 10.152.183.90 no (ingress-public) integration missing
keystone-mysql-router 8.0.35-0ubuntu0.22.04.1 active 1 mysql-router-k8s 8.0/stable 96 10.152.183.105 no
mysql 8.0.36-0ubuntu0.22.04.1 active 1 mysql-k8s 8.0/stable 153 10.152.183.41 no
vault active 1 vault-k8s 1.15/edge 61 10.152.183.162 no
Unit Workload Agent Address Ports Message
bind/0* active idle 10.1.52.250
keystone-mysql-router/0* active idle 10.1.52.243
keystone/0* blocked idle 10.1.52.221 (ingress-public) integration missing
mysql/0* active idle 10.1.52.219 Primary
vault/0* active idle 10.1.52.210
Storage Unit Storage ID Type Pool Mountpoint Size Status Message
keystone/0 credential-keys/5 filesystem kubernetes /var/lib/juju/storage/credential-keys/0 5.0 MiB attached Successfully provisioned volume pvc-b09a1664-1a18-4d5c-aea1-6f9bc201bf6f
keystone/0 fernet-keys/6 filesystem kubernetes /var/lib/juju/storage/fernet-keys/0 5.0 MiB attached Successfully provisioned volume pvc-d019e957-1333-47f4-9daf-14b7738fe96e
mysql/0 database/4 filesystem kubernetes /var/lib/juju/storage/database/0 1.0 GiB attached Successfully provisioned volume pvc-eb79f498-3481-4d37-beed-e45088e474b6
vault/0 certs/0 filesystem kubernetes /var/lib/juju/storage/certs/0 5.0 MiB attached Successfully provisioned volume pvc-0e49b998-ed8a-423c-8c9c-7acdf2a3299c
vault/0 config/1 filesystem kubernetes /var/lib/juju/storage/config/0 5.0 MiB attached Successfully provisioned volume pvc-8438101c-dd1a-4793-b39e-cb93502c8031
vault/0 tmp/2 filesystem kubernetes /var/lib/juju/storage/tmp/0 5.0 GiB attached Successfully provisioned volume pvc-b5f06a1f-a1c0-4d18-8f7b-58a8ba28acf9
vault/0 vault-raft/3 filesystem kubernetes /var/lib/juju/storage/vault-raft/0 10 GiB attached Successfully provisioned volume pvc-35309c9c-e93d-407f-b9bc-9ab96a8cd996
So cannot catch the bug.
@anvial As I mentioned in my previous comments [1] and [2], I am able to reproduce the problem only 1 out of total 40 times with the PR included in the tests and so I am happy to close the issue.
I would appreciate if you can look at logs mentioned in [1] and if possible form an hypothetical scenario that might cause the issue and add any more logs if required in code so that it will be helpful if problem is reproduced again.
Either way please feel free to close the issue. Appreciate if minor release of terraform-provider-juju that includes PR can be planned sonner.
[1] https://github.com/juju/terraform-provider-juju/issues/537#issuecomment-2274503369 [2] https://github.com/juju/terraform-provider-juju/issues/537#issuecomment-2277150424
@hemanthnakkina got you, I would try to do some scripting in case to catch the but.
If you find a more reliable way to catch this annoying thing, plz let me know.
@anvial I will suggest not to waste on your time in scripting to catch the bug as I already invested time to run overnight continuously. Just take a look at logs if it provide any hints on what could have happened in error case. Lets close the case if you are not able to find anything from logs.
Ok, let's close this bug. We will reopen it in case we find a way how to reproduce it more reliably.
Description
sunbeam-terraform plan when ran with juju plugin 0.13.0 fails with
cannot process filesystems
. The failure is not consistent but happened in 2 out of 3 builds.Urgency
Blocker for our release
Terraform Juju Provider version
0.13.0
Terraform version
1.9.1
Juju version
3.5.2
Terraform Configuration(s)
Reproduce / Test
Debug/Panic Output
Notes & References
https://github.com/canonical/sunbeam-terraform/actions/runs/10140842506/job/28040230530?pr=83
I will try to come up with smaller plan to reproduce. I have added details to provide early report.