nmstate / kubernetes-nmstate

Declarative node network configuration driven through Kubernetes API.
GNU General Public License v2.0
176 stars 87 forks source link

Generate events for NMState Handler failures #1190

Closed andreaskaris closed 1 year ago

andreaskaris commented 1 year ago

On top of reporting status via nodenetworkconfigurationenactments, also generate events for failures. These events directly show up in the output of kubectl describe nodenetworkconfigurationpolicies and thus provide a standardized way for administrators to understand why a reconciliation failed.

Is this a BUG FIX or a FEATURE ?:

/kind enhancement

What this PR does / why we need it:

Administrators might not know that they can check their node's status with the nodenetworkconfigurationenactments resource. It's also a non-standard way of reporting the failure of a resource.

Special notes for your reviewer:

Consider this a suggestion / brainstorming for how issues with reconciling resources are reported to the users. Would it make sense to use events? If so, should they contain the full error message, or refer to the enactment resource? I personally really didn't know about the enactment resource until I started working on this PR, and I guess it might be the same for other people using kubernetes-nmstate. Reporting reconciliation issues through the enactment resource seems a bit non-standard to me, and the nodenetworkconfigurationpolicy atm only reports:

Message:                           x/y nodes failed to configure

With this change, we'd get something like this:

$ oc describe nodenetworkconfigurationpolicies
Name:         vlan
Namespace:    
Labels:       <none>
Annotations:  nmstate.io/webhook-mutating-timestamp: 1685383832230481361
API Version:  nmstate.io/v1
Kind:         NodeNetworkConfigurationPolicy
Metadata:
  Creation Timestamp:  2023-05-29T18:10:32Z
  Generation:          1
  Managed Fields:
    API Version:  nmstate.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:desiredState:
          .:
          f:interfaces:
    Manager:      kubectl-client-side-apply
    Operation:    Update
    Time:         2023-05-29T18:10:32Z
    API Version:  nmstate.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:conditions:
        f:lastUnavailableNodeCountUpdate:
    Manager:         manager
    Operation:       Update
    Subresource:     status
    Time:            2023-05-29T18:12:39Z
  Resource Version:  19056
  UID:               f0df04cb-773e-4b45-8ff5-8341a6db4dbb
Spec:
  Desired State:
    Interfaces:
      ipv4:
        Dhcp:     true
        Enabled:  true
      Name:       eth3.102
      State:      up
      Type:       vlan
      Vlan:
        Base - Iface:  eth3
        Id:            102
Status:
  Conditions:
    Last Heartbeat Time:               2023-05-29T18:12:39Z
    Last Transition Time:              2023-05-29T18:12:39Z
    Reason:                            FailedToConfigure
    Status:                            False
    Type:                              Available
    Last Heartbeat Time:               2023-05-29T18:12:39Z
    Last Transition Time:              2023-05-29T18:12:39Z
    Message:                           1/1 nodes failed to configure
    Reason:                            FailedToConfigure
    Status:                            True
    Type:                              Degraded
    Last Heartbeat Time:               2023-05-29T18:12:39Z
    Last Transition Time:              2023-05-29T18:12:39Z
    Reason:                            ConfigurationProgressing
    Status:                            False
    Type:                              Progressing
  Last Unavailable Node Count Update:  2023-05-29T18:12:39Z
Events:
  Type     Reason           Age    From                    Message
  ----     ------           ----   ----                    -------
  Warning  ReconcileFailed  2m14s  node02.nmstate-handler  error reconciling NodeNetworkConfigurationPolicy on node node02 at desired state apply: "",
 failed to execute nmstatectl set --no-commit --timeout 480: 'exit status 1' '' 'Using 'set' is deprecated, use 'apply' instead.
[2023-05-29T18:10:32Z INFO  nmstate::nispor::base_iface] Got unsupported interface type Other("IpTun"): tunl0, ignoring
[2023-05-29T18:10:32Z INFO  nmstate::nispor::show] Got unsupported interface tunl0 type Other("IpTun")
[2023-05-29T18:10:32Z INFO  nmstate::nm::show] Got unsupported interface type ip-tunnel: tunl0, ignoring
[2023-05-29T18:10:32Z INFO  nmstate::query_apply::net_state] Created checkpoint /org/freedesktop/NetworkManager/Checkpoint/8
[2023-05-29T18:10:32Z INFO  nmstate::ifaces::inter_ifaces] Ignoring interface cali48c2ba6b6dd type ethernet
[2023-05-29T18:10:32Z INFO  nmstate::nm::query_apply::profile] Creating connection UUID Some("f86eb1bb-198a-4c31-8a27-064e5b10e987"), ID Some("eth3.102"), type Some("vlan") name Some("eth3.102")
[2023-05-29T18:10:33Z INFO  nmstate::nm::query_apply::profile] Activating connection f86eb1bb-198a-4c31-8a27-064e5b10e987: eth3.102/vlan
[2023-05-29T18:10:33Z INFO  nmstate::nm::query_apply::profile] Got activation failure Bug: Manager(UnknownDevice): Failed to find a compatible device for this connection
[2023-05-29T18:10:33Z INFO  nmstate::nm::query_apply::profile] Will retry activation 2 seconds
[2023-05-29T18:10:35Z INFO  nmstate::nm::query_apply::profile] Activating connection f86eb1bb-198a-4c31-8a27-064e5b10e987: eth3.102/vlan
[2023-05-29T18:10:35Z INFO  nmstate::nm::query_apply::profile] Got activation failure Bug: Manager(UnknownDevice): Failed to find a compatible device for this connection
[2023-05-29T18:10:35Z INFO  nmstate::nm::query_apply::profile] Will retry activation 4 seconds
[2023-05-29T18:10:39Z INFO  nmstate::nm::query_apply::profile] Activating connection f86eb1bb-198a-4c31-8a27-064e5b10e987: eth3.102/vlan
[2023-05-29T18:10:39Z INFO  nmstate::nm::query_apply::profile] Got activation failure Bug: Manager(UnknownDevice): Failed to find a compatible device for this connection
[2023-05-29T18:10:39Z INFO  nmstate::nm::query_apply::profile] Will retry activation 8 seconds
[2023-05-29T18:10:47Z INFO  nmstate::nm::query_apply::profile] Activating connection f86eb1bb-198a-4c31-8a27-064e5b10e987: eth3.102/vlan
[2023-05-29T18:10:47Z INFO  nmstate::nm::query_apply::profile] Got activation failure Bug: Manager(UnknownDevice): Failed to find a compatible device for this connection
[2023-05-29T18:10:47Z INFO  nmstate::nm::query_apply::profile] Will retry activation 16 seconds
[2023-05-29T18:11:03Z INFO  nmstate::nm::query_apply::profile] Activating connection f86eb1bb-198a-4c31-8a27-064e5b10e987: eth3.102/vlan
[2023-05-29T18:11:03Z INFO  nmstate::nm::query_apply::profile] Got activation failure Bug: Manager(UnknownDevice): Failed to find a compatible device for this connection
[2023-05-29T18:11:03Z INFO  nmstate::nm::query_apply::profile] Will retry activation 32 seconds
[2023-05-29T18:11:35Z INFO  nmstate::nm::query_apply::profile] Activating connection f86eb1bb-198a-4c31-8a27-064e5b10e987: eth3.102/vlan
[2023-05-29T18:11:35Z INFO  nmstate::query_apply::net_state] Retrying on: Bug: Manager(UnknownDevice): Failed to find a compatible device for this connection
[2023-05-29T18:11:37Z INFO  nmstate::nm::query_apply::profile] Modifying connection UUID Some("f86eb1bb-198a-4c31-8a27-064e5b10e987"), ID Some("eth3.102"), type Some("vlan") name Some("eth3.102")
[2023-05-29T18:11:37Z INFO  nmstate::nm::query_apply::profile] Activating connection f86eb1bb-198a-4c31-8a27-064e5b10e987: eth3.102/vlan
[2023-05-29T18:11:37Z INFO  nmstate::nm::query_apply::profile] Got activation failure Bug: Manager(UnknownDevice): Failed to find a compatible device for this connection
[2023-05-29T18:11:37Z INFO  nmstate::nm::query_apply::profile] Will retry activation 2 seconds
[2023-05-29T18:11:39Z INFO  nmstate::nm::query_apply::profile] Activating connection f86eb1bb-198a-4c31-8a27-064e5b10e987: eth3.102/vlan
[2023-05-29T18:11:39Z INFO  nmstate::nm::query_apply::profile] Got activation failure Bug: Manager(UnknownDevice): Failed to find a compatible device for this connection
[2023-05-29T18:11:39Z INFO  nmstate::nm::query_apply::profile] Will retry activation 4 seconds
[2023-05-29T18:11:43Z INFO  nmstate::nm::query_apply::profile] Activating connection f86eb1bb-198a-4c31-8a27-064e5b10e987: eth3.102/vlan
[2023-05-29T18:11:43Z INFO  nmstate::nm::query_apply::profile] Got activation failure Bug: Manager(UnknownDevice): Failed to find a compatible device for this connection
[2023-05-29T18:11:43Z INFO  nmstate::nm::query_apply::profile] Will retry activation 8 seconds
[2023-05-29T18:11:51Z INFO  nmstate::nm::query_apply::profile] Activating connection f86eb1bb-198a-4c31-8a27-064e5b10e987: eth3.102/vlan
[2023-05-29T18:11:51Z INFO  nmstate::nm::query_apply::profile] Got activation failure Bug: Manager(UnknownDevice): Failed to find a compatible device for this connection
[2023-05-29T18:11:51Z INFO  nmstate::nm::query_apply::profile] Will retry activation 16 seconds
[2023-05-29T18:12:07Z INFO  nmstate::nm::query_apply::profile] Activating connection f86eb1bb-198a-4c31-8a27-064e5b10e987: eth3.102/vlan
[2023-05-29T18:12:07Z INFO  nmstate::nm::query_apply::profile] Got activation failure Bug: Manager(UnknownDevice): Failed to find a compatible device for this connection
[2023-05-29T18:12:07Z INFO  nmstate::nm::query_apply::profile] Will retry activation 32 seconds
[2023-05-29T18:12:39Z INFO  nmstate::nm::query_apply::profile] Activating connection f86eb1bb-198a-4c31-8a27-064e5b10e987: eth3.102/vlan
[2023-05-29T18:12:39Z INFO  nmstate::query_apply::net_state] Rollbacked to checkpoint /org/freedesktop/NetworkManager/Checkpoint/8
NmstateError: Bug: Manager(UnknownDevice): Failed to find a compatible device for this connection
'

Note: I wonder if that's too much text, especially when someone updates a configurationpolicy

$ oc get events
9m42s       Warning   ReconcileFailed   nodenetworkconfigurationpolicy/vlan   error reconciling NodeNetworkConfigurationPolicy on node node02 at desired state apply: "",...

Release note:

generate events for NMState Handler failures 
kubevirt-bot commented 1 year ago

Hi @andreaskaris. Thanks for your PR.

I'm waiting for a nmstate member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
qinqon commented 1 year ago

/ok-to-test

qinqon commented 1 year ago

/lgtm /approve

andreaskaris commented 1 year ago

/retest-required

qinqon commented 1 year ago

/retest

andreaskaris commented 1 year ago

/retest

qinqon commented 1 year ago

@andreaskaris are the unit test failure related ?

andreaskaris commented 1 year ago

I thought it was due to flakes but I'll take another look

kubevirt-bot commented 1 year ago

@andreaskaris: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-nmstate-e2e-handler-k8s-future eac5862d3fe9ea3774777520a4d8f1f64f269513 link false /test pull-kubernetes-nmstate-e2e-handler-k8s-future
Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
qinqon commented 1 year ago

/lgtm /approve

kubevirt-bot commented 1 year ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: qinqon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/nmstate/kubernetes-nmstate/blob/main/OWNERS)~~ [qinqon] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment