siderolabs / sidero

Sidero Metal is a bare metal provisioning system with support for Kubernetes Cluster API.
https://www.sidero.dev
Mozilla Public License 2.0
410 stars 64 forks source link

Sidero panics when serving metadata #1076

Closed uhthomas closed 1 year ago

uhthomas commented 1 year ago

I've been struggling to provision some servers as they would get stuck when trying to fetch their metadata.

It looks like this is because Sidero panics when attempting to serve said metadata. This happens for all servers. None of them can load the metadata correctly.

I even tried starting from scratch and see the same error.

https://github.com/siderolabs/sidero/blob/18116bcabe77f1e7aa8d33b4d10cf2ff4a7cd18b/app/sidero-controller-manager/internal/metadata/metadata_server.go#LL297

2023/03/09 21:13:54 HTTP GET /boot.ipxe 127.0.0.1:52260
2023/03/09 21:13:54 received metadata request for uuid: 4c4c4544-0054-3510-8057-c7c04f424232
2023/03/09 21:13:54 http: panic serving 10.5.0.1:58686: runtime error: invalid memory address or nil pointer dereference
goroutine 34661 [running]:
net/http.(*conn).serve.func1()
        /toolchain/go/src/net/http/server.go:1850 +0xbf
panic({0x17643a0, 0x28cbbc0})
        /toolchain/go/src/runtime/panic.go:890 +0x262
github.com/siderolabs/sidero/app/sidero-controller-manager/internal/metadata.labelNodes({0xc000b78000, 0x1657, 0x1b00}, {0xc0003120f0, 0x24})
        /src/app/sidero-controller-manager/internal/metadata/metadata_server.go:297 +0x165
github.com/siderolabs/sidero/app/sidero-controller-manager/internal/metadata.(*metadataConfigs).FetchConfig(0xc0003ac260, {0x1c1eb20, 0xc00055e9a0}, 0xc0003d2700?)
        /src/app/sidero-controller-manager/internal/metadata/metadata_server.go:228 +0x5f2
net/http.HandlerFunc.ServeHTTP(0x0?, {0x1c1eb20?, 0xc00055e9a0?}, 0x203000?)
        /toolchain/go/src/net/http/server.go:2109 +0x2f
net/http.(*ServeMux).ServeHTTP(0xc00072c8e0?, {0x1c1eb20, 0xc00055e9a0}, 0xc0003d2700)
        /toolchain/go/src/net/http/server.go:2487 +0x149
main.main.func7.1({0x1c1eb20?, 0xc00055e9a0?}, 0xc000d0b2c0?)
        /src/app/sidero-controller-manager/main.go:307 +0xa5
net/http.HandlerFunc.ServeHTTP(0x203000?, {0x1c1eb20?, 0xc00055e9a0?}, 0xc00055e9a0?)
        /toolchain/go/src/net/http/server.go:2109 +0x2f
golang.org/x/net/http2/h2c.h2cHandler.ServeHTTP({{0x1c0b700?, 0xc0003be1c8?}, 0xc0006a6000?}, {0x1c1eb20, 0xc00055e9a0}, 0xc0003d2700)
        /.cache/mod/golang.org/x/net@v0.4.0/http2/h2c/h2c.go:125 +0x59b
net/http.serverHandler.ServeHTTP({0xc000d0b260?}, {0x1c1eb20, 0xc00055e9a0}, 0xc0003d2700)
        /toolchain/go/src/net/http/server.go:2947 +0x30c
net/http.(*conn).serve(0xc0003e9360, {0x1c1fcb8, 0xc0006703f0})
        /toolchain/go/src/net/http/server.go:1991 +0x607
created by net/http.(*Server).Serve
        /toolchain/go/src/net/http/server.go:3102 +0x4db
2023/03/09 21:14:23 received metadata request for uuid: 4c4c4544-0047-4410-8034-b9c04f575631
2023/03/09 21:14:23 http: panic serving 10.5.0.1:48912: runtime error: invalid memory address or nil pointer dereference
goroutine 34769 [running]:
net/http.(*conn).serve.func1()
        /toolchain/go/src/net/http/server.go:1850 +0xbf
panic({0x17643a0, 0x28cbbc0})
        /toolchain/go/src/runtime/panic.go:890 +0x262
github.com/siderolabs/sidero/app/sidero-controller-manager/internal/metadata.labelNodes({0xc000cd2000, 0x1657, 0x1b00}, {0xc0007f4330, 0x24})
        /src/app/sidero-controller-manager/internal/metadata/metadata_server.go:297 +0x165
github.com/siderolabs/sidero/app/sidero-controller-manager/internal/metadata.(*metadataConfigs).FetchConfig(0xc0003ac260, {0x1c1eb20, 0xc0002da0e0}, 0xc000af4d00?)
        /src/app/sidero-controller-manager/internal/metadata/metadata_server.go:228 +0x5f2
net/http.HandlerFunc.ServeHTTP(0x0?, {0x1c1eb20?, 0xc0002da0e0?}, 0x203000?)
        /toolchain/go/src/net/http/server.go:2109 +0x2f
net/http.(*ServeMux).ServeHTTP(0xc000aa48e0?, {0x1c1eb20, 0xc0002da0e0}, 0xc000af4d00)
        /toolchain/go/src/net/http/server.go:2487 +0x149
main.main.func7.1({0x1c1eb20?, 0xc0002da0e0?}, 0xc0007fdbc0?)
        /src/app/sidero-controller-manager/main.go:307 +0xa5
net/http.HandlerFunc.ServeHTTP(0x203000?, {0x1c1eb20?, 0xc0002da0e0?}, 0xc0002da0e0?)
        /toolchain/go/src/net/http/server.go:2109 +0x2f
golang.org/x/net/http2/h2c.h2cHandler.ServeHTTP({{0x1c0b700?, 0xc0003be1c8?}, 0xc0006a6000?}, {0x1c1eb20, 0xc0002da0e0}, 0xc000af4d00)
        /.cache/mod/golang.org/x/net@v0.4.0/http2/h2c/h2c.go:125 +0x59b
net/http.serverHandler.ServeHTTP({0xc0007fdb90?}, {0x1c1eb20, 0xc0002da0e0}, 0xc000af4d00)
        /toolchain/go/src/net/http/server.go:2947 +0x30c
net/http.(*conn).serve(0xc0002b0280, {0x1c1fcb8, 0xc0006703f0})
        /toolchain/go/src/net/http/server.go:1991 +0x607
created by net/http.(*Server).Serve
        /toolchain/go/src/net/http/server.go:3102 +0x4db
uhthomas commented 1 year ago

I think I know what's causing this.

I have a patch, which looks like this:

configPatches: [{
    op:   "add"
    path: "/machine"
    value: {
        install: diskSelector: wwid: "naa.50026b7381886726"
        network: interfaces: [{
            addresses: ["10.0.0.102"]
            bond: interfaces: ["eth4", "eth5"]
        }]
    }
}]

This almost certainly overrides basically all the machine defaults. As such config.MachineConfig.MachineKubelet.KubeletExtraArgs will inevitably panic.

frezbo commented 1 year ago

the patch is being used wrong, if you add a /machine path, it'll remove everything under /machine and only add in what you passed, please use the path as /machine/install/diskSelector, also regarding the network interfaces, there's no parent being provided, the parent interface name is required.

uhthomas commented 1 year ago

Yeah, I understand.

Is it okay to leave this open to track a feature which either makes it easier to know what's happening or reject a configuration which completely removes the base configuration? I imagine it's likely never intentional to do this.

frezbo commented 1 year ago

reject a configuration which completely removes the base configuration

that is tricky, since it's possible for a user to provide the whole machine keys with user provided one

uhthomas commented 1 year ago

I don't doubt that. I'd like to suggest validation be extended further as seemingly simple and valid patches don't work either. To be clear, I understand why, but it's just not very intuitive. These patch errors surface when the machine tries to provision with a config and isn't ideal.

{
        op:    "replace"
    path:  "/machine/install/diskSelector/wwid"
    value: "abc"
}
sidero-controller-manager-7b559896bd-tgvkh manager 2023/03/13 20:46:08 failure applying rfc6902 patches to machine config: replace operation does not apply: doc is missing path: /machine/install/diskSelector/wwid: missing value
{
        op:    "add"
    path:  "/machine/install/diskSelector/wwid"
    value: "abc"
}
sidero-controller-manager-7b559896bd-tgvkh manager 2023/03/13 22:02:03 failure applying rfc6902 patches to machine config: add operation does not apply: doc is missing path: "/machine/install/diskSelector/wwid": missing value
uhthomas commented 1 year ago

Or even:

{
    "op": "add",
    "path": "/machine/network/interfaces/-",
    "value": {
        "interface": "bond0",
        "addresses": [
            "10.0.0.105"
        ]
    }
}
sidero-controller-manager-7b559896bd-tgvkh manager 2023/03/13 23:12:29 failure applying rfc6902 patches to machine config: add operation does not apply: doc is missing path: "/machine/network/interfaces/-": missing value

Am I missing something? The docs don't seem to offer much help.

https://www.talos.dev/v1.3/talos-guides/configuration/patching/

smira commented 1 year ago

yes, this path doesn't exist in the default config - talosctl gen config provides you an easier way to test it. machine.network: {} is the default config state