openstack-k8s-operators / edpm-ansible

External Dataplane Management Ansible Playbooks
https://openstack-k8s-operators.github.io/edpm-ansible/
Apache License 2.0
9 stars 66 forks source link

`osp.edpm.edpm_kernel : Reboot after kernel args update` Task Fails #298

Closed vkhitrin closed 1 year ago

vkhitrin commented 1 year ago

When setting edpm_kernel_args, during dataplane deployment, the play fails and the host is not rebooted.

fatal: [edpm-compute-0]: FAILED! => {
    "changed": false,
    "elapsed": 0,
    "msg": "Reboot command failed. Error was: '\u001b[0;1;38;5;185mFailed to set wall message, ignoring: Interactive authentication required.\u001b[0m\r\n\u001b[0;1;38;5;185mFailed to schedule shutdown: Interactive authentication required.\u001b[0m, OpenSSH_8.8p1, OpenSSL 3.0.8 7 Feb 2023\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug3: /etc/ssh/ssh_config line 55: Including file /etc/ssh/ssh_config.d/50-redhat.conf depth 0\r\ndebug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf\r\ndebug2: checking match for 'final all' host edpm-compute-0 originally edpm-compute-0\r\ndebug3: /etc/ssh/ssh_config.d/50-redhat.conf line 3: not matched 'final'\r\ndebug2: match not found\r\ndebug3: /etc/ssh/ssh_config.d/50-redhat.conf line 5: Including file /etc/crypto-policies/back-ends/openssh.config depth 1 (parse only)\r\ndebug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config\r\ndebug3: gss kex names ok: [gss-curve25519-sha256-,gss-nistp256-sha256-,gss-group14-sha256-,gss-group16-sha512-]\r\ndebug3: kex names ok: [curve25519-sha256,curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group14-sha256,diffie-hellman-group16-sha512,diffie-hellman-group18-sha512]\r\ndebug1: configuration requests final Match pass\r\ndebug1: re-parsing configuration\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug3: /etc/ssh/ssh_config line 55: Including file /etc/ssh/ssh_config.d/50-redhat.conf depth 0\r\ndebug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf\r\ndebug2: checking match for 'final all' host edpm-compute-0 originally edpm-compute-0\r\ndebug3: /etc/ssh/ssh_config.d/50-redhat.conf line 3: matched 'final'\r\ndebug2: match found\r\ndebug3: /etc/ssh/ssh_config.d/50-redhat.conf line 5: Including file /etc/crypto-policies/back-ends/openssh.config depth 1\r\ndebug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config\r\ndebug3: gss kex names ok: [gss-curve25519-sha256-,gss-nistp256-sha256-,gss-group14-sha256-,gss-group16-sha512-]\r\ndebug3: kex names ok: [curve25519-sha256,curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group14-sha256,diffie-hellman-group16-sha512,diffie-hellman-group18-sha512]\r\ndebug3: expanded UserKnownHostsFile '~/.ssh/known_hosts' -> '/home/runner/.ssh/known_hosts'\r\ndebug3: expanded UserKnownHostsFile '~/.ssh/known_hosts2' -> '/home/runner/.ssh/known_hosts2'\r\ndebug1: auto-mux: Trying existing master\r\ndebug2: fd 3 setting O_NONBLOCK\r\ndebug2: mux_client_hello_exchange: master version 4\r\ndebug3: mux_client_forwards: request forwardings: 0 local, 0 remote\r\ndebug3: mux_client_request_session: entering\r\ndebug3: mux_client_request_alive: entering\r\ndebug3: mux_client_request_alive: done pid = 32\r\ndebug3: mux_client_request_session: session request sent\r\ndebug1: mux_client_request_session: master session id: 2\r\ndebug3: mux_client_read_packet: read header failed: Broken pipe\r\ndebug2: Received exit status from master 1\r\nShared connection to edpm-compute-0 closed.'",
    "rebooted": false,
    "start": "2023-08-26T17:33:09.830624"
}

Trace of this task (excluding the failure above): https://paste.openstack.org/show/bBW5C5xX5YCgxPu0Hyrj/

Ansible variables passed to CR:

{
  "dns_search_domains": [],
  "edpm_chrony_ntp_servers": [
    "clock.redhat.com"
  ],
  "edpm_iscsid_image": "{{ registry_url }}/openstack-iscsid:{{ image_tag }}",
  "edpm_kernel_args": "default_hugepagesz=1GB hugepagesz=1G hugepages=64 iommu=pt intel_iommu=on tsx=off isolcpus=2-47,50-95",
  "edpm_logrotate_crond_image": "{{ registry_url }}/openstack-cron:{{ image_tag }}",
  "edpm_network_config_hide_sensitive_logs": false,
  "edpm_network_config_os_net_config_mappings": {
    "nodegroup1": {
      "dmiString": "system-product-name",
      "id": "PowerEdge R750",
      "nic1": "eno8303",
      "nic10": "ens1f3",
      "nic11": "ens2f0np0",
      "nic12": "ens2f1np1",
      "nic2": "eno8403",
      "nic3": "eno12399",
      "nic4": "eno12409",
      "nic5": "eno12419",
      "nic6": "eno12429",
      "nic7": "ens1f0",
      "nic8": "ens1f1",
      "nic9": "ens1f2"
    }
  },
  "edpm_network_config_template": "templates/single_nic_vlans/single_nic_vlans.j2",
  "edpm_nodes_validation_validate_controllers_icmp": false,
  "edpm_nodes_validation_validate_gateway_icmp": false,
  "edpm_nova_compute_container_image": "{{ registry_url }}/openstack-nova-compute:{{ image_tag }}",
  "edpm_nova_libvirt_container_image": "{{ registry_url }}/openstack-nova-libvirt:{{ image_tag }}",
  "edpm_ovn_controller_agent_image": "{{ registry_url }}/openstack-ovn-controller:{{ image_tag }}",
  "edpm_ovn_dbs": [
    "192.168.71.31"
  ],
  "edpm_ovn_metadata_agent_DEFAULT_bind_host": "127.0.0.1",
  "edpm_ovn_metadata_agent_DEFAULT_transport_url": "rabbit://default_user_nI7eBK0oJQmz7Ku59bt:FaZfK8TPvqw5uLLMLkx_aSAKqrejrrV2@rabbitmq.openstack.svc:5672",
  "edpm_ovn_metadata_agent_default_bind_host": "127.0.0.1",
  "edpm_ovn_metadata_agent_image": "{{ registry_url }}/openstack-neutron-metadata-agent-ovn:{{ image_tag }}",
  "edpm_ovn_metadata_agent_metadata_agent_DEFAULT_metadata_proxy_shared_secret": 1234567842,
  "edpm_ovn_metadata_agent_metadata_agent_DEFAULT_nova_metadata_host": null,
  "edpm_ovn_metadata_agent_metadata_agent_default_metadata_proxy_shared_secret": 12345678,
  "edpm_ovn_metadata_agent_metadata_agent_ovn_ovn_sb_connection": "tcp:192.168.71.31:6642",
  "edpm_selinux_mode": "enforcing",
  "edpm_sshd_allowed_ranges": [
    "192.168.122.0/24"
  ],
  "edpm_sshd_configure_firewall": true,
  "edpm_tuned_isolated_cores": "2-47,50-95",
  "edpm_tuned_profile": "cpu-partitioning",
  "enable_debug": false,
  "gather_facts": false,
  "image_tag": "current-podified",
  "networks_lower": {
    "External": "external",
    "InternalApi": "internal_api",
    "Storage": "storage",
    "Tenant": "tenant"
  },
  "neutron_physical_bridge_name": "br-ex",
  "neutron_public_interface_name": "eth0",
  "registry_url": "quay.io/podified-antelope-centos9",
  "role_networks": [
    "InternalApi",
    "Storage",
    "Tenant"
  ],
  "service_net_map": {
    "nova_api_network": "internal_api",
    "nova_libvirt_network": "internal_api"
  }
}
vkhitrin commented 1 year ago

PR in #301 resolves this issue.

I did notice that the configure-network service is recreated after 10 minutes, which may not be enough for all baremetals to boot.

 oc get pods | grep configure-network
# The second execution with successful reboot operation
dataplane-deployment-configure-network-edpm-compute-lz9ml         0/1     Error       0          17m
# First execution when baremetal did not finish booting from successful metal 3 provisioning
dataplane-deployment-configure-network-edpm-compute-w2t4w         0/1     Error       0          28m
# Third execution after the timeout of second execution (and successful reboot) 
dataplane-deployment-configure-network-edpm-compute-wg895         1/1     Running     0          6m14s