kernelkit / infix

Linux :yellow_heart: NETCONF = Infix
https://kernelkit.org
GNU General Public License v2.0
46 stars 12 forks source link

confd: fails removing veth pair after a multple reconfigurations #658

Closed ahmkar94 closed 13 hours ago

ahmkar94 commented 1 day ago

Current Behavior

The NETCONF request provided creates a bridge br-X with the port ethX (mapped to a specific port on the target). In the example below, the test initializes the environment and attaches the target, then configures the bridge and its associated interface:

with infamy.Test() as test:
    with test.step("Initialize"):
        env  = infamy.Env()
        target = env.attach("target", "mgmt")

        _, eth_X = env.ltop.xlate("target", "ethX")
        br_X = "br-X"

    with test.step("Configure bridge brX and associated interfaces"):
        target.put_config_dict("ietf-interfaces", {
        "interfaces": {
            "interface": [
                {
                    "name": br_X,
                    "type": "infix-if-type:bridge",
                    "enabled": True
                },
                {
                    "name": eth_X,
                    "type": "infix-if-type:ethernet",
                    "enabled": True,
                    "infix-interfaces:bridge-port": {
                        "bridge": br_X
                    }
                }
            ]
        }
    })

By looking at the logs on the target we see that confd crashes:

Sep 26 16:13:21 test-00-01-00 dagger[3255]: Aborting: /run/net/2/action/init/e3/10-ethtool-aneg.sh failed with exitcode 75
Sep 26 16:13:21 test-00-01-00 dagger[3255]: Abandoned generation 2
Sep 26 16:13:21 test-00-01-00 confd[3255]: Failed to apply interface configuration
Sep 26 16:13:21 test-00-01-00 confd[3255]: Oups, error detected in SR_EV_DONE
Sep 26 16:13:21 test-00-01-00 confd[3255]: failed sr_subscription_process_events(), ret:7
Sep 26 16:13:21 test-00-01-00 finit[1]: Stopping netopeer[3771], sending SIGTERM ...
Sep 26 16:13:21 test-00-01-00 finit[1]: Stopping statd[3716], sending SIGTERM ...
Sep 26 16:13:21 test-00-01-00 finit[1]: Stopping rousette[3801], sending SIGTERM ...
Sep 26 16:13:21 test-00-01-00 finit[1]: Service confd[3255] died, restarting in 2000 msec (1/10)
Sep 26 16:13:21 test-00-01-00 finit[1]: Starting confd[4462]

The issue stems from the fact that, although the request specifies a valid Ethernet type (ethernet) for the ethX interface, the target system expects a different type, specifically etherlike. However, even though there is a type mismatch, this should not cause the confd process to crash.

The crash indicates a bug or a problem with the error handling mechanism in confd.

Expected Behavior

Ideally, the system should reject the configuration and provide a meaningful error message, rather than crashing entirely.

Steps To Reproduce

No response

Additional information

No response

troglobit commented 17 hours ago

Nice finding, but not a blocker for v24.09.0. Also, confd doesn't technically crash -- it fails to validate input in SR_EV_CHANGE and lets the invalid configuration propagate to SR_EV_DONE (where you're really allowed to fail in sysrepo terms). When we encounter an error in this state confd (or rather sysrepo-plugind) now calls exit() explicitly to let the system try to recover by restarting everything.

I suggest we change the title to: "confd fails input validation of interface type".

troglobit commented 16 hours ago

Core team has continued discussing this issue, it has now evolved into a blocker issue for v24.09.

Root cause, reconfiguring the system multiple times after initially adding a VETH pair makes it impossible to remove the VETH pair.