Closed dannysauer closed 4 years ago
/var/lib/etcd (or whatever the var is set to) should be set to 0700. :)
hm, isn't this what kubeadm is doing by default? https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/cmd/phases/init/etcd.go#L90
However, in a cleanup, that necessary os.mkdirall was removed
do you mean this line? https://github.com/kubernetes/kubernetes/commit/6bbed9fef00d9789abf041f3961c9543e7a4ff45#diff-0960dc0bb7c3e7113a0daa027856b8f9L467
it is still here https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/cmd/phases/init/etcd.go#L90
here is the version mapping between kubeadm(k8s) version and etcd: https://github.com/kubernetes/kubernetes/blob/3f585835d0a38462bb928454245ec01e834e53a0/cmd/kubeadm/app/constants/constants.go#L430-L437
so i guess this will happen if one tries to use kubeadm <1.14 with etcd server > 3.4.10? this is not recommended by the kubeadm team, but also 1.14 is out of support at this point (1.16 to go out of support soon too).
There were two mkdirs - the one at bootstrap time (which is still there) and a second one which ran when additional nodes were run. The referenced cleanup SR removed the second one. Which makes sense in the context of the observed behavior; additional control-plane nodes have the /var/lib/etcd directory created with mode 0755 (the kubernetes default), and looking at some older clusters I have around, have been doing that for quite some time. Assuming I identified the right PR, it's been happening since 1.14, and still happening on the 1.18 cluster I stood up this morning to test the new etcd. It would've also been happening before #1308
, but those systems are solidly out of support by now.
so i guess this will happen if one tries to use kubeadm <1.14 with etcd server > 3.4.10? Possibly, but the greater concern is seeing it happen with >= 1.14 and etcd server >= 3.4.10
ok, so originally a similar fix was added here in the function that creates the static pods for 1.14-pre: https://github.com/kubernetes/kubernetes/commit/836f413cf1096c9b020b20319d0767aee4f9b990#diff-c4574f3918f016aeb3b32f5d9cb62ed6
later that code was moved (by ereslibre): https://github.com/kubernetes/kubernetes/commit/981bf1930c73a7d95bbbd1dc9b3bfff122ad09a8
and then this refactor that you linked indeed omitted it. https://github.com/kubernetes/kubernetes/pull/73452 that PR was very noisy and had 140+ comments and we might have missed this.
so yes, we should not let the kubelet create the path with 755, and include the following:
// pre-create the etcd data directory with the right permissions
if err := os.MkdirAll(cfg.Etcd.Local.DataDir, 0700); err != nil {
return errors.Wrapf(err, "failed to create etcd directory %q", cfg.Etcd.Local.DataDir)
}
// if the directory already existed and the above call was a no-op, ensure it has the right permissions
// or otherwise etcd 3.4.10+ will fail:
// https://github.com/etcd-io/etcd/pull/11798
if err := os.Chmod(cfg.Etcd.Local.DataDir, 0700); err != nil {
return errors.Wrapf(err, "failed to chmod etcd directory %q", cfg.Etcd.Local.DataDir)
}
at this line: https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/cmd/phases/join/controlplanejoin.go#L133
thanks for reporting it.
/help /area security /milestone v1.20 /priority important-longterm
i don't see this as a critical bug for 1.19 during code freeze.
given there is a chance an older version of kubeadm could be used to run etcd 3.4.10 we might as well backport the change to the support skew (1.19, 1.18, 1.17) after 1.19 releases.
@neolit123: This request has been marked as needing help from a contributor.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help
command.
one problem though is if the user is upgrading an existing cluster that has the 755 data dir. in that case we should also ensure to chmod it due to https://github.com/etcd-io/etcd/pull/11798
EDIT: ... but i guess this also has to be done on upgrade. removing "help" as this requires more changes. /remove-help
I also just run into this problem trying to upgrade an existing etcd cluster from 3.4.9 to 3.4.11. The etcd node refuses to start with:
Aug 19 11:03:43 etcd0-dev etcd[26085]: cannot access data directory: directory "/var/lib/etcd","drwxr-xr-x" exist without desired file permissions
@frittenlab hi, what k8s/kubeadm version are you using?
looks like we might be picking the latest etcd in 1.19, which means we must patch kubeadm too: https://github.com/kubernetes/kubernetes/issues/93450 https://groups.google.com/d/msgid/kubernetes-sig-api-machinery/CAOXj30tEQzy0uSq9Oc4dX2UfSR%3DsCaqWwwUzymLrDxLirvHJ%2BQ%40mail.gmail.com?utm_medium=email&utm_source=footer
also:
/milestone v1.19
@neolit123 We are currently using kubeadm / k8s 1.18.8. We use an external etcd cluster which we provisioned ourselves on dedicated localstorage vm's. We added the etcd cluster to kubeadm via the kubeadm-config configmap. I was able to upgrade the etcd cluster after changing the directory permissions of
/var/lib/etcd
to 700
@frittenlab just to double check, i'm assuming you did not follow this guide: https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/setup-ha-etcd-with-kubeadm/ IIUC, if using that guide (kubeadm init) the directory should have been created with 0700.
PR for master is here: https://github.com/kubernetes/kubernetes/pull/94102
@neolit123 We did follow this guide. We created the cluster over 2 years ago. I can't recall having to change the mode of the data dir then. I added the option to out ansible role for the etcd cluster. And everything works fine now. This was not necessary with previous etc versions.
ok, in the above PR i've added a chmod 700 in kubeadm init even if the directory already exists, just in case.
Thanks that is a good idea :)
looks like we might be picking the latest etcd in 1.19, which means we must patch kubeadm too:
that's only updating the client library, not the server version
that's only updating the client library, not the server version
ok, i still think that including the fix in 1.19 is a good idea to avoid the case where 1.19 kubeadm is used to deploy the newer (problematic) etcd.
ok, i still think that including the fix in 1.19 is a good idea to avoid the case where 1.19 kubeadm is used to deploy the newer (problematic) etcd.
agree
actually, I think @jingyih is working on resolving the regression in etcd ... not sure if kubeadm should work around it or wait for an etcd fix. I think kubernetes manifests will stay on 3.4.9 until this is resolved
@jingyih in which etcd version are you planning to resolve this? would you recommend that kubeadm pre-creates this directory with 0700 for all OSes?
if a fix in etcd is coming, we can just add a note about this in the kubeadm troubleshooting guide and move the pending PR to 1.20. the PR itself can continue pre-creating the directory with 0700 (similar is done for /var/lib/kubelet) but possibly omit the chmod, as it overrides custom permissions by the user.
The etcd behavior simply illuminated a security regression in kubeadm, which is why #1308 was referenced in the report. There's no fix to etcd which undoes that regression (unless etcd changes the datadir permissions itself, which is even more sketchy than kubeadm updating the old incorrect perms).
Admittedly, world-readable perms at the top level of the datadir isn't a huge deal in practice, as etcd creates the member subdir mode 0700 and thus there shouldn't be any information disclosure risk. But leaving the current behavior is going to cause people following the CIS kubernetes benchmark and similar hardening guides with a bit of heartburn when they have to fix it on about every control plane node. Never mind how many people are doing so, given this has gone unnoticed for years. :facepalm:
I think kubernetes manifests will stay on 3.4.9 until this is resolved
FWIW, staying on etcd 3.4.9 means retaining a security hole fixed in 3.4.10 - CVE-2020-15106. That's why I was updating an etcd package to begin with. :) Also a difficult-to-exploit security hole, but a published CVE none the less.
The etcd behavior simply illuminated a security regression in kubeadm, which is why #1308 was referenced in the report. There's no fix to etcd which undoes that regression (unless etcd changes the datadir permissions itself, which is even more sketchy than kubeadm updating the old incorrect perms).
yes, that is why i think https://github.com/kubernetes/kubernetes/pull/94102 is still viable. but while enforcing 700 on existing folders makes sense, it also overrides customized permissions such as 0770.
Admittedly, world-readable perms at the top level of the datadir isn't a huge deal in practice, as etcd creates the member subdir mode 0700 and thus there shouldn't be any information disclosure risk.
with the member sub-directory being protected with 0700, it is indeed not that much of security issue.
FWIW, staying on etcd 3.4.9 means retaining a security hole fixed in 3.4.10 - CVE-2020-15106. That's why I was updating an etcd package to begin with. :) Also a difficult-to-exploit security hole, but a published CVE none the less.
we rarely update the default etcd server version (constant) in prior kubeadm releases but it has happened before. yet, for 1.19 we might be out of time to release a new etcd server version at this point.
kubeadm makes it possible to deploy a custom etcd image/version, but in later kubeadm versions this comes with the trade off that etcd upgrades will be skipped when doing "kubeadm upgrade".
while enforcing 700 on existing folders makes sense, it also overrides customized permissions such as 0770.
Yeah, I'd either leave that out, or check for root/root:0755 and only change in that case. Surely no one's consciously choosing that. :)
actually, I think @jingyih is working on resolving the regression in etcd ... not sure if kubeadm should work around it or wait for an etcd fix. I think kubernetes manifests will stay on 3.4.9 until this is resolved
I discussed with @spzala about this. We are thinking about providing a warning message instead of enforcing the file permission: https://github.com/etcd-io/etcd/pull/12242
updated the PR to only create the directory if it does not exist on init/join-control-plane, but not chmod it. https://github.com/kubernetes/kubernetes/pull/94102
/remove-priority important-soon /priority backlog lowering priority since the fix in etcd was applied and 1.20 will include 3.4.13+. (there is also discussion to backport the etcd version to 1.19)
https://github.com/kubernetes/kubernetes/pull/94102 merged newer etcd was backported to 1.19 too: https://github.com/kubernetes/kubernetes/pull/94536
closing, thanks.
Is this a BUG REPORT or FEATURE REQUEST?
BUG REPORT
Versions
kubeadm version (use
kubeadm version
): 1.14+Environment: N/A
What happened?
With the release of etcd 3.4.10, the datadir permissions now need to be 0700 or etcd won't start. There was an issue (#1308) where perms were set on join before starting the etcd container as a security control, overriding the default behavior of creating a non-existant directory mode 0755. However, in a cleanup, that necessary os.mkdirall was removed. This was transparently ignored for several releases since etcd didn't complain, but with etcd-io/etcd#11798 (in 3.4.10), the new etcd cluster on the second node does not start.
I'm pretty sure this will break anyone on k8s 1.14 or newer who upgrades to etcd 3.4.10 or newer without first fixing the /var/lib/etcd perms.
What you expected to happen?
/var/lib/etcd (or whatever the var is set to) should be set to 0700. :)
How to reproduce it (as minimally and precisely as possible)?
Join a second master node, then
ls -ld /var/lib/etcd
on the node. With an etcd 3.4.10 or newer runtimeAnything else we need to know?
It's worth explicitly noting that the first control plane node added works fine. It's just the second and subsequent nodes which were handled in a separate location in the code which exhibit the problem.