Open uhthomas opened 1 month ago
This is not how talos works.
You can use a root container and run any command you like or use the talos api.
@Syntax3rror404 What? /var
is writable. The linked issue is impossible to recover from because the CRI is broken and containers can't be run.
As described, I had to start a Ubuntu live ISO, decrypt and mount the partition and remove the directory manually. It should be possible to just do this with talosctl instead.
If the container state is broken you can wipe the EPHEMERAL fs.
talosctl -n my.node reset --system-labels-to-wipe EPHEMERAL --reboot
Yes, and delete everything in /var
in the process? Applications like rook ceph store critical data in /var
and so that is not an option.
No, because I hope you have your data on a seperate disk which is also best practise.
Sidero Labs recommends having separate disks (apart from the Talos install disk) to be used for storage. Source: https://www.talos.dev/v1.8/kubernetes-guides/configuration/storage/#:~:text=easy%20and%20automatic.-,Storage%20Clusters,when%20managing%20your%20own%20storage.
There will be no rm/vi, but we plan to have an ability for fine-grained wiping of volumes, including directories, like containerd state.
That would be good, thanks.
Could you help me understand the motivation behind not having basic tools like rm/mv/vi for the ephemeral partition? Sometimes it is really necessary and the premise of not needing to SSH into the machine because Talos has sufficient APIs becomes a bit less valuable.
https://www.siderolabs.com/blog/how-to-ssh-into-talos-linux/
No one should need to load up a live ISO of another OS just to copy, move, edit or remote some files on a writable partition.
In addition, assuming the CSI isn't broken, why support this through running arbitrary containers and not just add it to talosctl? If talosctl can read files on the filesystem and list directories, then it should be able to do a bit more too? I would be a bit disappointed if the talosctl ls and talosctl read commands were removed and replaced with documentation which suggests running a container with ls and cat in it.
No, because I hope you have your data on a seperate disk which is also best practise.
Sidero Labs recommends having separate disks (apart from the Talos install disk) to be used for storage.
I don't even want to reply to this as I am shocked at how this issue is being downplayed and straw manned.
The documentation you linked simply suggests using rook ceph... which I just said I am. Rook stores configuration data and mon data on the ephemeral partition and it is extremely disruptive to the cluster if deleted. Rather than simply removing a directory to fix the CRI, you think it's reasonable to wipe the ephemeral partition, reset all the OSDs, reimport them back into rook and spend potentially days or weeks recovering terabytes of data?
While the situation is indeed unfortunate, adding rm/touch/write
is not something we are williung to do, since it open a lot unwanted possibilities and basically ruins immutable
part of Talos. We will continue investigating the part Andrey said:
we plan to have an ability for fine-grained wiping of volumes, including directories, like containerd state...
I think there no need for any additional heat in this discussion.
Could you help me understand the motivation behind not having basic tools like rm/mv/vi for the ephemeral partition? Sometimes it is really necessary and the premise of not needing to SSH into the machine because Talos has sufficient APIs becomes a bit less valuable.
Because that opens a Pandora box of having non-structured access to the machine and goes against the Talos principles. Anything you store should be structured in some way - e.g. etcd database, containerd state, your Rook/Ceph state, etc.
Right now Talos doesn't offer that structure properly, and that's what we would like to address. I don't really want your Rook/Ceph data to be treated as a big EPHEMERAL
bag, while rather I would give you control over it, including having that as a separate disk/partition, being a directory under /var
or anything else.
Yep, that's fair enough, I am behind the philosophy of Talos being an immutable and reproducible OS.
Unfortunately the ephemeral partition is not immutable and as such basic tools are required for situations like this. Even with plans to redesign the concept of the ephemeral partition, it may still be necessary to be a bit more granular than just wiping all of containerd, rook, etc.
I am just concerned that Talos may not have important tooling which is available on every other OS. Talos shouldn't feel like it gets in the way, but in this instance it definitely did and has done in the past. I'm afraid this sort of thing will push people away from non conventional OS's like Talos and stick to just running Kubernetes on Ubuntu, which we don't want.
Would you be able to provide some insight into how you would like to see the ephemeral partition redesigned? Would it address all of these concerns?
You can see #8367 and #8016 for some overview.
I guess my perspective is that modifying /var is already supported, but the advice is to run a container. Why make it more difficult, and why not just add native tools for doing relatively common operations. Does that make sense? Friction does not make for a good user experience, and in this case it's not even possible to do some things because the CRI is broken.
I understand your concern, but we do not plan to add a free-form rm -rf
API.
Kubernetes nodes should be replaceable/reinstallable. The precious data should be kept out separated from other data. E.g. container state shouldn't be treated as persisted.
We plan to offer better ways to manage disk volumes as I posted above.
I absolutely agree with the philosophy and sentiment, but regardless of the changes made to persistence, there will be mutable data. The advice for any operations will be to run a container and mount the host file system. Why have this friction when it can just be added to Talos without much effort? Especially for weird cases like when the CRI is broken. Today the fix was to remove the entire directory, tomorrow it could be renaming a file or removing a specific file. Instead of just talosctl rm /var/lib/containerd/some-file
a user has to find a suitable container and run it, or if that's not possible then use a live ISO? It's just such a bad user experience and I am not sure how else to express this. No matter what Talos does as a project, there will be mutable data and no tools to manage it is a huge hinderance.
We already said that we aren't going to add rm
and other "mutate state"-like commands, because it goes completely against what Talos stands for. We are looking at the ways of managing mutable state easier, and we understand the need to solve specific problems. But adding general command for the specific problem is an overkill and not something we are willing to do.
There is no need to repeat the same points, as it's not something that going to change our decision, but will force us to lock this thread.
Feature Request
Description
My node won't boot because of #9496, and there is no way to fix it without mounting the volume in another OS and removing
/var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db
. I shouldn't need to do that.There have also been times where I've needed to edit a config file stored on
/var
and couldn't, so avi
like editor would be nice too.