siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.39k stars 514 forks source link

Talos v1.6.6 - Storage goes invalid after we apply ZFS extension #8820

Open Rammurthy5 opened 3 months ago

Rammurthy5 commented 3 months ago

Bug Report

Storage on worker nodes go invalid after we apply ZFS extension

Description

Storage on worker nodes go invalid after we apply ZFS extension. This is on AWS ec2 platform installed with Cilium CNI and Kubespan.

Logs

Environment

smira commented 3 months ago

Please provide a detailed report on what is going on exactly.

Rammurthy5 commented 3 months ago

@smira ,

I created a brand new Talos cluster on AWS EC2 with ZFS and iSCSi extensions installed with hugepages, nvme kernel module config as its all mentioned in requirements for Mayastor. it was all good as long as until i added ZFS. As soon as this extension is added to the workers, this 0 storage issue occurs.

Another try with just ZFS and no iSCSi extension or hugepages config. It has openebs-maystor disabled but ZFS pods were running. I was unable to create storages with ZFS as we kept seeing invalid storage issue occured on worker nodes.

brief logs:

"Failed to get the info of the filesystem with mountpoint","mountpoint":"/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs","err":"unable to find data in memory cache"}

"Image garbage collection failed once. Stats initialization may not have completed yet","err":"invalid capacity 0 on image filesystem"}

"error":"PLEG is not healthy: pleg has yet to be successful"}]}
smira commented 3 months ago

You might need to dig further, ZFS extension itself works (we have integration tests), so there's something going on further down the line. I'm not sure how ZFS affects containerd exactly, or what kind of configuration you're trying to do.

Rammurthy5 commented 3 months ago

@smira could i request the steps you have followed to get ZFS fully working please? 🙇🏻 I'd follow the same and see if it helps.

smira commented 3 months ago

https://github.com/siderolabs/talos/blob/9d395b9de94f28fb9bf56bf795f916f783a847a0/internal/integration/api/extensions_qemu.go#L555-L713

Here is the code from the integration test. ZFS extension is a community project, so it might be that you need to reach out for some community help here.

Rammurthy5 commented 2 months ago

Hi @smira , can i ask how do we install zpool on talos workers please ? Couldn't find anything on the talos doc. I have extensions, and kernel for ZFS in place already. LocalPV provisioner and zfs controller are running. I need to install zpool and create zpools.

smira commented 2 months ago

ZFS is a community extension, I don't have any specific examples at the moment besides what I posted above.