siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.62k stars 530 forks source link

NBD kernel module for rook csi-rbdplugin not loadable #7687

Closed DreamingRaven closed 1 year ago

DreamingRaven commented 1 year ago

Bug Report

Description

rook-ceph csi-rbdplugin requires the NBD kernel module. I have added the nbd kernel module to machine.kernel.modules[{name: nbd}]. I then upgraded the node to ensure the module was propagated. However this module is not loaded. This causes the csi-rbdplugin to fatally error.

# worker node with lvm config
machine:
  kernel:
    modules:
    - name: dm_raid
    - name: dm_mod
    - name: md_mod
    - name: raid0
    - name: raid1
    - name: raid10
    - name: raid456
    - name: rbd #<--- loaded correctly
    - name: nbd # <--- rbdplugin required kernel module not loaded properly
    - name: ceph #<--- loaded correctly
  time:
      disabled: false # Indicates if the time service is disabled for the machine.
      servers:
          - time.cloudflare.com
      bootTimeout: 2m0s # Specifies the timeout when the node time is considered to be in sync unlocking the boot sequence.
  kubelet:
    extraArgs:
      rotate-server-certificates: true
  network:
    hostname: *************
    interfaces:
    - interface: bond0
      dhcp: true
      bond:
        mode: 802.3ad
        lacpRate: fast
        xmitHashPolicy: layer3+4
        miimon: 100
        updelay: 200
        downdelay: 200
        interfaces:
        - eth0
  install:
    diskSelector:
      type: nvme

Have I done something wrong with my invocation? Or is there a different process necessary for loading this specific kernel module?

Logs

Talos reports:

 user: warning: [2023-08-30T06:41:45.96772245Z]: [talos] controller failed {"component": "controller-runtime", "controller": "runtime.KernelModuleSpecController", "error": "error loading module \x5c"nbd\x5c": module not found"}

and in rook csi-rbdplugin:

Internal error occurred: Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)

W0822 12:00:49.715482  199208 rbd_attach.go:225] nbd modprobe failed (an error (exit status 1) occurred while running modprobe args: [nbd]): "modprobe: FATAL: Module nbd not found in directory /lib/modules/6.1.41-talos\n"

W0822 12:01:02.891895   29389 rbd_attach.go:225] nbd modprobe failed (an error (exit status 1) occurred while running modprobe args: [nbd]): "modprobe: FATAL: Module nbd not found in directory /lib/modules/6.1.41-talos\n"

W0830 05:51:56.793960    4423 rbd_attach.go:225] nbd modprobe failed (an error (exit status 1) occurred while running modprobe args: [nbd]): "modprobe: FATAL: Module nbd not found in directory /lib/modules/6.1.45-talos\n"

E0830 05:52:26.835118    4423 rbd_healer.go:172] list volumeAttachments failed, err: Get "https://10.96.0.1:443/apis/storage.k8s.io/v1/volumeattachments": dial tcp 10.96.0.1:443: i/o timeout

E0830 05:52:26.835130    4423 driver.go:193] healer had failures, err Get "https://10.96.0.1:443/apis/storage.k8s.io/v1/volumeattachments": dial tcp 10.96.0.1:443: i/o timeout

Environment

P.S created this seperate issue from my side remarks in https://github.com/siderolabs/talos/issues/7677

frezbo commented 1 year ago

nbd is not shipped in talos kernel, so loading it won't work https://github.com/siderolabs/pkgs/blob/main/kernel/build/config-amd64#L1972. These can be shipped as extensions if needed

DreamingRaven commented 1 year ago

Cheers thanks @frezbo for the heads up! I suspected that might be the case, will take a look at extending it myself when I have an opportunity. Thanks for the quick response too!