operate-first / support

This repo should serve as a central source for users to raise issues/questions/requests for Operate First.
GNU General Public License v3.0
15 stars 25 forks source link

Infra cluster fails to apply NodeNetworkConfigurationPolicies #632

Closed tumido closed 2 years ago

tumido commented 2 years ago

Originally reported by @larsks:

While we have previously configured the NESE storage VLAN on these nodes, it looks like the interface is currently having some problems:

$ k get nncp vlan-211-nese
NAME            STATUS
vlan-211-nese   FailedToConfigure

That will need to be resolved before the nodes will be able to access the ceph cluster.

I'm going to be out for a few days (back on Tuesday 9/6); @naved001 can probably help out with networking and honestly is probably more careful than I am about some things :smile:.

Originally posted by @larsks in https://github.com/operate-first/apps/pull/2286#pullrequestreview-1091209654

I've discovered there's probably more to it and the issue goes a deeper:

$ ocg nncp | awk 'NR==1 || /.*Failed.*/'
NAME                                STATUS
crc-provisioning-vlan-ctrl-0        FailedToConfigure
moc-nfs-network                     FailedToConfigure
ocp-prod-provisioning-vlan-ctrl-1   FailedToConfigure
ocp-prod-provisioning-vlan-ctrl-2   FailedToConfigure
vlan-211-nese                       FailedToConfigure
zero-provisioning-vlan-ctrl-0       FailedToConfigure

$ ocg nnce | awk 'NR==1 || /.*Failed.*/'
NAME                                                                   STATUS
os-ctrl-0.crc-provisioning-vlan-ctrl-0                                 FailedToConfigure
os-ctrl-0.moc-nfs-network                                              FailedToConfigure
os-ctrl-0.zero-provisioning-vlan-ctrl-0                                FailedToConfigure
os-ctrl-1.moc-infra.massopen.cloud.vlan-211-nese                       FailedToConfigure
os-ctrl-1.ocp-prod-provisioning-vlan-ctrl-1                            FailedToConfigure
os-ctrl-2.ocp-prod-provisioning-vlan-ctrl-2                            FailedToConfigure

The NESE policy fails on:

$ ocg nnce os-ctrl-1.moc-infra.massopen.cloud.vlan-211-nese -o jsonpath='{.status.conditions[?(@.type=="Failing")].message}'
...
2022-08-24 14:26:09,460 root         DEBUG    Async action: Add profile: 7a63793f-2e54-4cd5-bc74-0cc367b84037, iface:eno1.211, type:vlan started
2022-08-24 14:26:09,476 root         DEBUG    Async action: Add profile: 7a63793f-2e54-4cd5-bc74-0cc367b84037, iface:eno1.211, type:vlan finished
2022-08-24 14:26:09,477 root         DEBUG    Async action: Activate profile uuid:7a63793f-2e54-4cd5-bc74-0cc367b84037 iface:eno1.211 type: vlan started
2022-08-24 14:26:09,484 root         DEBUG    Action Activate profile uuid:7a63793f-2e54-4cd5-bc74-0cc367b84037 iface:eno1.211 type: vlan failed, trying again.
2022-08-24 14:26:09,492 root         DEBUG    Async action: Rollback to checkpoint /org/freedesktop/NetworkManager/Checkpoint/1 started
2022-08-24 14:26:09,507 root         DEBUG    Checkpoint /org/freedesktop/NetworkManager/Checkpoint/1 rollback executed
...
2022-08-24 14:26:09,555 root         DEBUG    Async action: Rollback to checkpoint /org/freedesktop/NetworkManager/Checkpoint/1 finished
Traceback (most recent call last):
  File "/usr/bin/nmstatectl", line 11, in <module>
    load_entry_point('nmstate==1.0.2', 'console_scripts', 'nmstatectl')()
  File "/usr/lib/python3.6/site-packages/nmstatectl/nmstatectl.py", line 73, in main
    return args.func(args)
  File "/usr/lib/python3.6/site-packages/nmstatectl/nmstatectl.py", line 326, in set
    return apply(args)
  File "/usr/lib/python3.6/site-packages/nmstatectl/nmstatectl.py", line 354, in apply
    args.save_to_disk,
  File "/usr/lib/python3.6/site-packages/nmstatectl/nmstatectl.py", line 407, in apply_state
    save_to_disk=save_to_disk,
  File "/usr/lib/python3.6/site-packages/libnmstate/netapplier.py", line 81, in apply
    _apply_ifaces_state(plugins, net_state, verify_change, save_to_disk)
  File "/usr/lib/python3.6/site-packages/libnmstate/netapplier.py", line 114, in _apply_ifaces_state
    plugin.apply_changes(net_state, save_to_disk)
  File "/usr/lib/python3.6/site-packages/libnmstate/nm/plugin.py", line 233, in apply_changes
    NmProfiles(self.context).apply_config(net_state, save_to_disk)
  File "/usr/lib/python3.6/site-packages/libnmstate/nm/profiles.py", line 91, in apply_config
    self._ctx.wait_all_finish()
  File "/usr/lib/python3.6/site-packages/libnmstate/nm/context.py", line 213, in wait_all_finish
    raise tmp_error
libnmstate.error.NmstateLibnmError: Activate profile uuid:7a63793f-2e54-4cd5-bc74-0cc367b84037 iface:eno1.211 type: vlan failed: error=nm-manager-error-quark: Failed to find a compatible device for this connection (3)
nerdalert commented 2 years ago

Hi @tumido @naved001 I got pointed at this issue to see if we could assist.

A few quick thoughts:

Happy to dig in further if you want to hit me up on CoreOS Slack/RH Gchat. Cheers!

tumido commented 2 years ago

@nerdalert Thank you for reaching out! :) That seems to be it!

At first, I've checked if the eno1 is not managed by NM. nmcli device was showing it as:

# nmcli device            
DEVICE                                                    TYPE           STATE         CONNECTION
...
eno1                                                      ethernet       connected     Wired Connection
...

Then I've checked if eno1 is set to bond type for whatever reason, but it was correctly typed as ethernet:

# nmcli conn
NAME              UUID                                  TYPE           DEVICE
Wired Connection  7fa7f094-b884-4dd7-bb09-d7c35251e059  ethernet       eno1

Then I've tried applying the change from BZ 2017623 as you've suggested...

FTR the change forces eno1 to be managed by NM/NMState:

  spec:
    desiredState:
      interfaces:
+       - name: eno1
+         state: up
+         type: ethernet
        - description: zero cluster provisioning network
          ipv4:
            dhcp: true
            enabled: true
          name: eno1.211
          state: up
          type: vlan
          vlan:
            base-iface: eno1
            id: 211

I didn't expect this to change anything however it resulted in creating a new connection, stealing the eno1 device from Wired Connection connection:

  # nmcli conn
  NAME              UUID                                  TYPE           DEVICE
  ...
+ eno1              75d0e44b-a6af-406b-903f-df3efaecccce  ethernet       eno1
+ Wired Connection  7fa7f094-b884-4dd7-bb09-d7c35251e059  ethernet       --
- Wired Connection  7fa7f094-b884-4dd7-bb09-d7c35251e059  ethernet       eno1

This resulted in:

$ oc get nncp vlan-211-nese
NAME            STATUS
vlan-211-nese   SuccessfullyConfigured

I think the mess caused by a name change in the connection, maybe during a cluster/node upgrade? Mind this is a baremetal instance rocking OCP for about 2 years now, starting at OCP 4.6, now at OCP 4.10... I think we faced some networking issues during one of the upgrades when nodes were renamed etc... so it may be related. It may have manifested now because we tried to apply a new network policy (just speculating).