ocp-power-automation / ocp4-upi-powervm

OpenShift on IBM PowerVM servers managed using PowerVC
Apache License 2.0
27 stars 51 forks source link

Removal of bootstrap node with NFS for shared storage hangs or has other issues #270

Open robgjertsen1 opened 1 year ago

robgjertsen1 commented 1 year ago

I've tried this 3 times and it fails each time. Initially I installed a cluster and was running workload on it and it was OK. I only saw issues once I tried to remove the bootstrap node and seeing issues accessing pvcs. The bootstrap remove got hung up (removed node from PowerVC and but stuck later on). An odd problem with NFS where IOs were hung but not an obvious issue with physical storage. Then I tried to remove bootstrap node immediately after recreating the cluster. This also resulted in issues where NFS filesystem wasn't mounted, and yet another issue where again the terraform execution was stuck after removing bootstrap node from PowerVC (yet NFS mount was OK here).

Here are some details below with last attempt. We are stuck in the gathering facts task for the ansible ocp4-helpernode playbook

Output from terraform:

====================================================================================================== $ terraform apply -var-file var.tfvars module.workernodes.data.ignition_file.w_hostname[0]: Reading... module.bootstrapnode.data.ignition_file.b_hostname: Reading... module.masternodes.data.ignition_file.m_hostname[0]: Reading... module.bootstrapnode.data.ignition_file.b_hostname: Read complete after 0s [id=1 ec8928da9e89f9b35deb26dd484665fda91d99d73e31330dce71edf3a4e19cc] module.masternodes.data.ignition_file.m_hostname[0]: Read complete after 0s [id= 7551bfa9e87523c711bf18607b8af5ccfee1657ea6c4817bbc3dd2186602f590] module.workernodes.data.ignition_file.w_hostname[0]: Read complete after 0s [id= 28b9dcc333049039879c9c1e94f95816f0341945047e8ae59674e1233f72be83] module.workernodes.data.openstack_compute_flavor_v2.worker: Reading... module.masternodes.data.openstack_compute_flavor_v2.master: Reading... module.bootstrapnode.data.openstack_compute_flavor_v2.bootstrap: Reading... module.bastion.openstack_compute_keypair_v2.key-pair[0]: Refreshing state... [id =merlin2-keypair] module.bastion.openstack_blockstorage_volume_v3.storage_volume[0]: Refreshing st ate... [id=317b2360-639d-4d8a-8b34-a58f1bb19ee9] module.bastion.data.openstack_compute_flavor_v2.bastion: Reading... module.network.data.openstack_networking_network_v2.network: Reading... module.network.data.openstack_networking_network_v2.network: Read complete after 2s [id=f5e55ae3-c790-4a29-91e1-ce04a1acfc69] module.network.data.openstack_networking_subnet_v2.subnet: Reading... module.workernodes.data.openstack_compute_flavor_v2.worker: Read complete after 2s [id=1e5b0eed-6681-4305-8bc9-e20afb9f7cca] module.bootstrapnode.data.openstack_compute_flavor_v2.bootstrap: Read complete a fter 2s [id=874b188b-074a-4042-b0c8-3a22f04f8302] module.masternodes.data.openstack_compute_flavor_v2.master: Read complete after 2s [id=d364331a-9f24-4784-bced-3765e0c097ed] module.bastion.data.openstack_compute_flavor_v2.bastion: Read complete after 2s [id=874b188b-074a-4042-b0c8-3a22f04f8302] module.network.data.openstack_networking_subnet_v2.subnet: Read complete after 0 s [id=63011d28-987a-4ae1-a094-595f2e513a23] module.network.openstack_networking_port_v2.bastion_port[0]: Refreshing state... [id=91a5c711-f109-4ec0-91e7-86cd821233cc] module.network.openstack_networking_port_v2.bootstrap_port[0]: Refreshing state. .. [id=7072aba5-ac95-4b36-994a-1855f2624b55] module.bastion.openstack_compute_instance_v2.bastion[0]: Refreshing state... [id =f874bcaf-e8d5-46f6-8088-652ee3b9930a] module.network.openstack_networking_port_v2.master_port[0]: Refreshing state... [id=6d0e1c9a-aa11-48c3-80cd-e22c2cbe8abe] module.network.openstack_networking_port_v2.worker_port[0]: Refreshing state... [id=3a683058-4670-4f2a-a701-fc21e56142de] module.bastion.null_resource.bastion_init[0]: Refreshing state... [id=5535521664 652524244] module.bastion.openstack_compute_volume_attach_v2.storage_v_attach[0]: Refreshin g state... [id=f874bcaf-e8d5-46f6-8088-652ee3b9930a/317b2360-639d-4d8a-8b34-a58f 1bb19ee9] module.bastion.null_resource.bastion_register[0]: Refreshing state... [id=390765 5651701511596] module.bastion.null_resource.enable_repos[0]: Refreshing state... [id=8446503032 307780766] module.bastion.null_resource.bastion_packages[0]: Refreshing state... [id=756303 5921889930989] module.bastion.null_resource.setup_nfs_disk[0]: Refreshing state... [id=57837008 01307001475] module.workernodes.data.ignition_config.worker[0]: Reading... module.bootstrapnode.data.ignition_config.bootstrap: Reading... module.workernodes.data.ignition_config.worker[0]: Read complete after 0s [id=85 d98bf1d766507417ab5b578be1abe6f3e6c0a80e57a931862b80f5ff8b4153] module.masternodes.data.ignition_config.master[0]: Reading... module.helpernode.null_resource.config: Refreshing state... [id=3876494058890088 587] module.masternodes.data.ignition_config.master[0]: Read complete after 0s [id=7a 035ac3f88d415956417f73f6ecd986a9d339cdbbea088f5332e0cd8a46de94] module.bootstrapnode.data.ignition_config.bootstrap: Read complete after 0s [id= 87f77fe2ea79f17615628f4222c5676d8c8062883faa6236a4ef9d6087f86729] module.installconfig.null_resource.pre_install[0]: Refreshing state... [id=14174 00108243665749] module.installconfig.null_resource.install_config: Refreshing state... [id=46832 85385320449241] module.bootstrapnode.openstack_compute_instance_v2.bootstrap[0]: Refreshing stat e... [id=2c289dad-9552-4037-8105-f798406ff623] module.bootstrapconfig.null_resource.bootstrap_config: Refreshing state... [id=6 822641950134297211] module.masternodes.openstack_compute_instance_v2.master[0]: Refreshing state... [id=0b820558-2077-40ee-81d9-811aa7dbc6d0] module.bootstrapcomplete.null_resource.bootstrap_complete: Refreshing state... [ id=285966427274519477] module.workernodes.openstack_compute_instance_v2.worker[0]: Refreshing state... [id=78da4c4f-d882-49c0-9e6b-94100492be63] module.workernodes.null_resource.remove_worker[0]: Refreshing state... [id=17321 02747293394534] module.install.null_resource.install: Refreshing state... [id=287553626659034692 8] module.install.null_resource.upgrade[0]: Refreshing state... [id=871900455648840 9504]

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:

Terraform will perform the following actions:

module.bastion.openstack_blockstorage_volume_v3.storage_volume[0] must be re

placed -/+ resource "openstack_blockstorage_volume_v3" "storage_volume" { ~ attachment = [

Plan: 3 to add, 0 to change, 5 to destroy.

Changes to Outputs: ~ bootstrap_ip = "9.5.36.167" -> ""

Do you want to perform these actions? Terraform will perform the actions described above. Only 'yes' will be accepted to approve.

Enter a value: yes

module.bootstrapnode.openstack_compute_instance_v2.bootstrap[0]: Destroying... [ id=2c289dad-9552-4037-8105-f798406ff623] module.bootstrapnode.openstack_compute_instance_v2.bootstrap[0]: Still destroyin g... [id=2c289dad-9552-4037-8105-f798406ff623, 10s elapsed] module.bootstrapnode.openstack_compute_instance_v2.bootstrap[0]: Still destroyin g... [id=2c289dad-9552-4037-8105-f798406ff623, 20s elapsed] module.bootstrapnode.openstack_compute_instance_v2.bootstrap[0]: Still destroyin g... [id=2c289dad-9552-4037-8105-f798406ff623, 30s elapsed] module.bootstrapnode.openstack_compute_instance_v2.bootstrap[0]: Destruction com plete after 34s module.helpernode.null_resource.config: Destroying... [id=3876494058890088587] module.helpernode.null_resource.config: Destruction complete after 0s module.bastion.openstack_compute_volume_attach_v2.storage_v_attach[0]: Destroyin g... [id=f874bcaf-e8d5-46f6-8088-652ee3b9930a/317b2360-639d-4d8a-8b34-a58f1bb19e e9] module.network.openstack_networking_port_v2.bootstrap_port[0]: Destroying... [id =7072aba5-ac95-4b36-994a-1855f2624b55] module.network.openstack_networking_port_v2.bootstrap_port[0]: Destruction compl ete after 7s module.bastion.openstack_compute_volume_attach_v2.storage_v_attach[0]: Destructi on complete after 9s module.bastion.openstack_blockstorage_volume_v3.storage_volume[0]: Destroying... [id=317b2360-639d-4d8a-8b34-a58f1bb19ee9] module.bastion.openstack_blockstorage_volume_v3.storage_volume[0]: Still destroy ing... [id=317b2360-639d-4d8a-8b34-a58f1bb19ee9, 10s elapsed] module.bastion.openstack_blockstorage_volume_v3.storage_volume[0]: Destruction c omplete after 11s module.helpernode.null_resource.config: Creating... module.bastion.openstack_blockstorage_volume_v3.storage_volume[0]: Creating... module.helpernode.null_resource.config: Provisioning with 'remote-exec'... module.helpernode.null_resource.config (remote-exec): Connecting to remote host via SSH... module.helpernode.null_resource.config (remote-exec): Host: 9.5.36.166 module.helpernode.null_resource.config (remote-exec): User: root module.helpernode.null_resource.config (remote-exec): Password: false module.helpernode.null_resource.config (remote-exec): Private key: true module.helpernode.null_resource.config (remote-exec): Certificate: false module.helpernode.null_resource.config (remote-exec): SSH Agent: false module.helpernode.null_resource.config (remote-exec): Checking Host Key: false module.helpernode.null_resource.config (remote-exec): Target Platform: unix module.helpernode.null_resource.config (remote-exec): Connected! module.helpernode.null_resource.config (remote-exec): Cloning into ocp4-helperno de... module.helpernode.null_resource.config (remote-exec): Note: switching to 'adb110 2f64b2f25a8a1b44a96c414f293d72d3fc'.

module.helpernode.null_resource.config (remote-exec): You are in 'detached HEAD' state. You can look around, make experimental module.helpernode.null_resource.config (remote-exec): changes and commit them, a nd you can discard any commits you make in this module.helpernode.null_resource.config (remote-exec): state without impacting an y branches by switching back to a branch.

module.helpernode.null_resource.config (remote-exec): If you want to create a ne w branch to retain commits you create, you may module.helpernode.null_resource.config (remote-exec): do so (now or later) by us ing -c with the switch command. Example:

module.helpernode.null_resource.config (remote-exec): git switch -c <new-branc h-name>

module.helpernode.null_resource.config (remote-exec): Or undo this operation wit h:

module.helpernode.null_resource.config (remote-exec): git switch -

module.helpernode.null_resource.config (remote-exec): Turn off this advice by se tting config variable advice.detachedHead to false

module.helpernode.null_resource.config (remote-exec): HEAD is now at adb1102 Mer ge pull request #305 from redhat-cop/devel module.helpernode.null_resource.config: Provisioning with 'file'... module.helpernode.null_resource.config: Still creating... [10s elapsed] module.bastion.openstack_blockstorage_volume_v3.storage_volume[0]: Still creatin g... [10s elapsed] module.helpernode.null_resource.config: Provisioning with 'file'... module.bastion.openstack_blockstorage_volume_v3.storage_volume[0]: Creation comp lete after 12s [id=35ba1876-52b6-4769-9950-eaf3be077eaa] module.bastion.openstack_compute_volume_attach_v2.storage_v_attach[0]: Creating. .. module.helpernode.null_resource.config: Provisioning with 'file'... module.helpernode.null_resource.config: Provisioning with 'remote-exec'... module.helpernode.null_resource.config (remote-exec): Connecting to remote host via SSH... module.helpernode.null_resource.config (remote-exec): Host: 9.5.36.166 module.helpernode.null_resource.config (remote-exec): User: root module.helpernode.null_resource.config (remote-exec): Password: false module.helpernode.null_resource.config (remote-exec): Private key: true module.helpernode.null_resource.config (remote-exec): Certificate: false module.helpernode.null_resource.config (remote-exec): SSH Agent: false module.helpernode.null_resource.config (remote-exec): Checking Host Key: false module.helpernode.null_resource.config (remote-exec): Target Platform: unix module.helpernode.null_resource.config (remote-exec): Connected! module.bastion.openstack_compute_volume_attach_v2.storage_v_attach[0]: Creation complete after 7s [id=f874bcaf-e8d5-46f6-8088-652ee3b9930a/35ba1876-52b6-4769-99 50-eaf3be077eaa] module.helpernode.null_resource.config (remote-exec): Running ocp4-helpernode pl aybook... module.helpernode.null_resource.config: Still creating... [20s elapsed] module.helpernode.null_resource.config (remote-exec): Using /root/ocp4-helpernod e/ansible.cfg as config file

module.helpernode.null_resource.config (remote-exec): PLAY [all] ***


module.helpernode.null_resource.config (remote-exec): TASK [Gathering Facts] ***


module.helpernode.null_resource.config: Still creating... [30s elapsed] ...

module.helpernode.null_resource.config: Still creating... [17h6m25s elapsed]

======================================================================================================

Initiating node info:

$ ps -ef | grep terraform gjertsen 3410758 12015 0 May01 pts/1 00:08:32 terraform apply -var-file va r.tfvars gjertsen 3411154 3410758 0 May01 pts/1 00:00:03 .terraform/providers/registr y.terraform.io/hashicorp/null/3.2.1/linux_amd64/terraform-provider-null_v3.2.1_x 5

======================================================================================================

bastion node state:

ps -ef | grep ansible

root 67764 67738 7 May01 pts/1 01:21:07 /usr/libexec/platform-python /usr/bin/ansible-playbook -i inventory -e @helpernode_vars.yaml tasks/main.yml -v --become root 67771 67764 0 May01 pts/1 00:00:00 /usr/libexec/platform-python /usr/bin/ansible-playbook -i inventory -e @helpernode_vars.yaml tasks/main.yml -v --become root 67782 1 0 May01 ? 00:00:00 ssh: /root/.ansible/cp/08610c3669 [mux] root 67890 67771 0 May01 pts/1 00:00:00 ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User="root" -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/08610c3669 -tt 9.5.36.166 /bin/sh -c '/usr/libexec/platform-python /root/.ansible/tmp/ansible-tmp-1682979218.3504968-67771-94255852006671/AnsiballZ_setup.py && sleep 0' root 67891 67783 0 May01 pts/3 00:00:00 /bin/sh -c /usr/libexec/platform-python /root/.ansible/tmp/ansible-tmp-1682979218.3504968-67771-94255852006671/AnsiballZ_setup.py && sleep 0 root 67912 67891 0 May01 pts/3 00:00:04 /usr/libexec/platform-python /root/.ansible/tmp/ansible-tmp-1682979218.3504968-67771-94255852006671/AnsiballZ_setup.py

NFS mount looks OK

exportfs

/export

ls -al /export

total 0 drwxrwxrwx. 3 nobody nobody 92 May 1 17:41 . dr-xr-xr-x. 19 root root 259 May 1 17:06 .. drwxrwxrwx. 2 nobody nobody 6 May 1 17:41 openshift-image-registry-registry-pvc-pvc-5b20c6ca-b184-41eb-b145-c5253c26015a

yussufsh commented 1 year ago

Not sure how I missed this issue. Suggest using markdown code format while pasting console logs.

Coming back to the root cause the main line that shows the reason for recreating the nfs disk(module.bastion.openstack_blockstorage_volume_v3.storage_volume[0]):

~ volume_type = "v7kamp.rch.stglabs.ibm.com base template" -> "6327
2fa4-2a99-4a94-ab1e-2a12fb64b1f8" # forces replacement

Seems the terraform provider for openstack is returning the storage template ID when querying the service. Which detects there is a change in the template for you as shown above. We have not used this feature recently but seems something is changed recently where only ID will work.

As a workaround please set variable volume_storage_template to a value "63272fa4-2a99-4a94-ab1e-2a12fb64b1f8" and run apply. This should not detect forced replacement change.