okd-project / okd

The self-managing, auto-upgrading, Kubernetes distribution for everyone
https://okd.io
Apache License 2.0
1.71k stars 294 forks source link

error running chcon -R -t var_run_t /run/mco-machine-os-content/os-content-X: chcon: cannot access ...': No such file or directory\n: exit status 1" #576

Closed msheldyakov closed 3 years ago

msheldyakov commented 3 years ago

Update from 4.7.0-0.okd-2021-03-21-094146 to 4.7.0-0.okd-2021-03-28-152009 UPI

All (worker/master) machine pools in degraded state: Node hw-control-okd-master-01.domain is reporting: "error running chcon -R -t var_run_t /run/mco-machine-os-content/os-content-681716591: chcon: cannot access '/run/mco-machine-os-content/os-content-681716591': No such file or directory\n: exit status 1"

On the node side, i saw os-content directory for a few seconds:

[root@hw-control-okd-master-01 core]# ls -la /run/mco-machine-os-content/
total 0
drwxr-xr-x.  4 root root   80 Mar 29 17:41 .
drwxr-xr-x. 47 root root 1240 Mar 29 17:40 ..
drwxr-xr-x.  4 root root  100 Mar 24 11:22 bootstrap
drwxr-xr-x.  3 root root   60 Mar 24 11:22 extensions
[root@hw-control-okd-master-01 core]# ls -la /run/mco-machine-os-content/
total 0
drwxr-xr-x.  5 root root  100 Mar 29 17:46 .
drwxr-xr-x. 47 root root 1240 Mar 29 17:47 ..
drwxr-xr-x.  4 root root  100 Mar 24 11:22 bootstrap
drwxr-xr-x.  3 root root   60 Mar 24 11:22 extensions
drwx------.  2 root root   40 Mar 29 17:46 os-content-980518135
[root@hw-control-okd-master-01 core]# ls -la /run/mco-machine-os-content/
total 0
drwxr-xr-x.  4 root root   80 Mar 29 17:47 .
drwxr-xr-x. 47 root root 1240 Mar 29 17:47 ..
drwxr-xr-x.  4 root root  100 Mar 24 11:22 bootstrap
drwxr-xr-x.  3 root root   60 Mar 24 11:22 extensions
[root@hw-control-okd-master-01 core]#

Log from machine-config-daemon-jv8gs `I0329 17:22:37.519626 2308 start.go:108] Version: machine-config-daemon-4.6.0-202006240615.p0-665-g2e8f00c4 (2e8f00c41266630862b66dd47730e52e94d794b5) I0329 17:22:37.522735 2308 start.go:121] Calling chroot("/rootfs") I0329 17:22:37.522894 2308 rpm-ostree.go:261] Running captured: rpm-ostree status --json I0329 17:22:38.779075 2308 daemon.go:224] Booted osImageURL: quay.io/openshift/okd-content@sha256:1fc150fd7b47122e3ff344839cd9b2f1085024beb279f6c5eb5c3c35fbbc9215 () I0329 17:22:38.875045 2308 daemon.go:231] Installed Ignition binary version: 2.9.0 I0329 17:22:38.886274 2308 start.go:97] Copied self to /run/bin/machine-config-daemon on host I0329 17:22:38.888803 2308 metrics.go:105] Registering Prometheus metrics I0329 17:22:38.888859 2308 metrics.go:110] Starting metrics listener on 127.0.0.1:8797 I0329 17:22:38.889775 2308 update.go:1942] Starting to manage node: hw-control-okd-master-01.tutu.ru I0329 17:22:38.892063 2308 rpm-ostree.go:261] Running captured: rpm-ostree status I0329 17:22:38.895240 2308 daemon.go:679] Detected a new login session: New session 1 of user core. I0329 17:22:38.895252 2308 daemon.go:680] Login access is discouraged! Applying annotation: machineconfiguration.openshift.io/ssh I0329 17:22:39.249394 2308 daemon.go:863] State: idle Deployments:

msheldyakov commented 3 years ago

Attempt to reverse update with (no luck): 1) oc adm upgrade --force to 4.7.0-0.okd-2021-03-21-094146 and complete fresh reinstall of some nodes. 2) Manual disable of cluster-version-operator and setting RELEASE_VERSION on machine-config-operator and restoring machine-config-operator-images/machine-config-osimageurl configmaps from backup.

vrutkovs commented 3 years ago

Do you have a must-gather after a failed upgrade?

msheldyakov commented 3 years ago

Currently no. Possibly later on another cluster.

Bengrunt commented 3 years ago

Hello, we've been hit by the same issue yesterday.

Upgrading a UPI cluster on BareMetal (6 nodes) from 4.7.0-0.okd-2021-03-21-094146 to 4.7.0-0.okd-2021-03-28-152009 as well.

I've done a must-gather archive but I'm wondering if there's an easy way to kind of "anonymize" hostnames and such ? There's like 775MB of data (65MB gzipped).

Thanks !

vrutkovs commented 3 years ago

W0329 17:26:34.389625 2308 run.go:44] nice failed: running nice -- ionice -c 3 oc image extract --path /:/run/mco-machine-os-content/os-content-759348359 --registry-config /var/lib/kubelet/config.json quay.io/openshift/okd-content@sha256:13a16cecc46dea3c626fa231825aa1a4e158d95616632856c32c73b992a19a9a failed: error: unable to load --registry-config: error occurred while trying to unmarshal json

This is odd - your kubelet pull secret is invalid. Do you use "fake" pull secret?

msheldyakov commented 3 years ago

W0329 17:26:34.389625 2308 run.go:44] nice failed: running nice -- ionice -c 3 oc image extract --path /:/run/mco-machine-os-content/os-content-759348359 --registry-config /var/lib/kubelet/config.json quay.io/openshift/okd-content@sha256:13a16cecc46dea3c626fa231825aa1a4e158d95616632856c32c73b992a19a9a failed: error: unable to load --registry-config: error occurred while trying to unmarshal json

This is odd - your kubelet pull secret is invalid. Do you use "fake" pull secret?

Yes, fake. pullSecret: '{"auths":{"fake":{"auth": "bar"}}}'

This cluster was installed as 4.6, went through several updates around 4.6 and then 4.7. Failed only at 4.7.0-0.okd-2021-03-28-152009

fortinj66 commented 3 years ago

If your cluster is still up, can you change /var/lib/kubelet/config.json on one of the SchedulingDisabled nodes to

See Below...

and reboot it?

msheldyakov commented 3 years ago

If your cluster is still up

No, it has already been restored from a backup.

fortinj66 commented 3 years ago

If you try again, try changing the config.json on your nodes as indicated above

lukeelten commented 3 years ago

I got the same error as the initial post above. I have an UPI bare metal installation with 3 master and 6 worker nodes.

Both machine config pools are degraded with the following error:

Node master2 is reporting: "error running chcon -R -t var_run_t /run/mco-machine-os-content/os-content-093307771: chcon: cannot access '/run/mco-machine-os-content/os-content-093307771': No such file or directory\n: exit status 1"

I also found the following error in the machine-config pod of the worker node which the cluster tries to upgrade:

I0330 12:53:11.601667 1488282 run.go:18] Running: nice -- ionice -c 3 podman cp bc16533866ca77f9cd06bdf343f507c0581341b42ef768d450df1ab272ac22d3:/ /run/mco-machine-os-content/os-content-395424672
Error: 2 errors occurred:
    * error copying to host: error during bulk transfer for copier.request{Request:"PUT", Root:"/", preservedRoot:"/run/mco-machine-os-content", rootPrefix:"/run/mco-machine-os-content", Directory:"/", preservedDirectory:"/run/mco-machine-os-content", Globs:[]string{}, preservedGlobs:[]string{}, StatOptions:copier.StatOptions{CheckForArchives:false, Excludes:[]string(nil)}, GetOptions:copier.GetOptions{UIDMap:[]idtools.IDMap(nil), GIDMap:[]idtools.IDMap(nil), Excludes:[]string(nil), ExpandArchives:false, ChownDirs:(*idtools.IDPair)(nil), ChmodDirs:(*os.FileMode)(nil), ChownFiles:(*idtools.IDPair)(nil), ChmodFiles:(*os.FileMode)(nil), StripSetuidBit:false, StripSetgidBit:false, StripStickyBit:false, StripXattrs:false, KeepDirectoryNames:false, Rename:map[string]string(nil)}, PutOptions:copier.PutOptions{UIDMap:[]idtools.IDMap(nil), GIDMap:[]idtools.IDMap(nil), DefaultDirOwner:(*idtools.IDPair)(nil), DefaultDirMode:(*os.FileMode)(nil), ChownDirs:(*idtools.IDPair)(0xc000621400), ChmodDirs:(*os.FileMode)(nil), ChownFiles:(*idtools.IDPair)(0xc000621410), ChmodFiles:(*os.FileMode)(nil), StripXattrs:false, IgnoreXattrErrors:false, IgnoreDevices:false, NoOverwriteDirNonDir:false, Rename:map[string]string(nil)}, MkdirOptions:copier.MkdirOptions{UIDMap:[]idtools.IDMap(nil), GIDMap:[]idtools.IDMap(nil), ChownNew:(*idtools.IDPair)(nil), ChmodNew:(*os.FileMode)(nil)}}: copier: put: error setting extended attributes on "/extensions/okd/NetworkManager-ovs-1.26.6-1.fc33.x86_64.rpm": error setting value of extended attribute "user.Zif.MdChecksum[1616584500]" on "/extensions/okd/NetworkManager-ovs-1.26.6-1.fc33.x86_64.rpm": operation not supported
    * error copying from container: error during bulk transfer for copier.request{Request:"GET", Root:"/", preservedRoot:"/var/lib/containers/storage/overlay/d3c885a10d5ad2ce003363349c76849d48eefc1a9b1c513e24c9b46e0afe5043/merged", rootPrefix:"/var/lib/containers/storage/overlay/d3c885a10d5ad2ce003363349c76849d48eefc1a9b1c513e24c9b46e0afe5043/merged", Directory:"/", preservedDirectory:"/var/lib/containers/storage/overlay/d3c885a10d5ad2ce003363349c76849d48eefc1a9b1c513e24c9b46e0afe5043/merged", Globs:[]string{"/"}, preservedGlobs:[]string{"/var/lib/containers/storage/overlay/d3c885a10d5ad2ce003363349c76849d48eefc1a9b1c513e24c9b46e0afe5043/merged/."}, StatOptions:copier.StatOptions{CheckForArchives:false, Excludes:[]string(nil)}, GetOptions:copier.GetOptions{UIDMap:[]idtools.IDMap(nil), GIDMap:[]idtools.IDMap(nil), Excludes:[]string(nil), ExpandArchives:false, ChownDirs:(*idtools.IDPair)(0xc0005053f0), ChmodDirs:(*os.FileMode)(nil), ChownFiles:(*idtools.IDPair)(0xc000505400), ChmodFiles:(*os.FileMode)(nil), StripSetuidBit:false, StripSetgidBit:false, StripStickyBit:false, StripXattrs:false, KeepDirectoryNames:false, Rename:map[string]string(nil)}, PutOptions:copier.PutOptions{UIDMap:[]idtools.IDMap(nil), GIDMap:[]idtools.IDMap(nil), DefaultDirOwner:(*idtools.IDPair)(nil), DefaultDirMode:(*os.FileMode)(nil), ChownDirs:(*idtools.IDPair)(nil), ChmodDirs:(*os.FileMode)(nil), ChownFiles:(*idtools.IDPair)(nil), ChmodFiles:(*os.FileMode)(nil), StripXattrs:false, IgnoreXattrErrors:false, IgnoreDevices:false, NoOverwriteDirNonDir:false, Rename:map[string]string(nil)}, MkdirOptions:copier.MkdirOptions{UIDMap:[]idtools.IDMap(nil), GIDMap:[]idtools.IDMap(nil), ChownNew:(*idtools.IDPair)(nil), ChmodNew:(*os.FileMode)(nil)}}: copier: get: "/"("/"): error copying /extensions/okd/checkpolicy-3.1-3.fc33.x86_64.rpm: write bulk-writer: broken pipe
W0330 12:53:13.125688 1488282 run.go:44] nice failed: running nice -- ionice -c 3 podman cp bc16533866ca77f9cd06bdf343f507c0581341b42ef768d450df1ab272ac22d3:/ /run/mco-machine-os-content/os-content-395424672 failed: Error: 2 errors occurred:
    * error copying to host: error during bulk transfer for copier.request{Request:"PUT", Root:"/", preservedRoot:"/run/mco-machine-os-content", rootPrefix:"/run/mco-machine-os-content", Directory:"/", preservedDirectory:"/run/mco-machine-os-content", Globs:[]string{}, preservedGlobs:[]string{}, StatOptions:copier.StatOptions{CheckForArchives:false, Excludes:[]string(nil)}, GetOptions:copier.GetOptions{UIDMap:[]idtools.IDMap(nil), GIDMap:[]idtools.IDMap(nil), Excludes:[]string(nil), ExpandArchives:false, ChownDirs:(*idtools.IDPair)(nil), ChmodDirs:(*os.FileMode)(nil), ChownFiles:(*idtools.IDPair)(nil), ChmodFiles:(*os.FileMode)(nil), StripSetuidBit:false, StripSetgidBit:false, StripStickyBit:false, StripXattrs:false, KeepDirectoryNames:false, Rename:map[string]string(nil)}, PutOptions:copier.PutOptions{UIDMap:[]idtools.IDMap(nil), GIDMap:[]idtools.IDMap(nil), DefaultDirOwner:(*idtools.IDPair)(nil), DefaultDirMode:(*os.FileMode)(nil), ChownDirs:(*idtools.IDPair)(0xc000621400), ChmodDirs:(*os.FileMode)(nil), ChownFiles:(*idtools.IDPair)(0xc000621410), ChmodFiles:(*os.FileMode)(nil), StripXattrs:false, IgnoreXattrErrors:false, IgnoreDevices:false, NoOverwriteDirNonDir:false, Rename:map[string]string(nil)}, MkdirOptions:copier.MkdirOptions{UIDMap:[]idtools.IDMap(nil), GIDMap:[]idtools.IDMap(nil), ChownNew:(*idtools.IDPair)(nil), ChmodNew:(*os.FileMode)(nil)}}: copier: put: error setting extended attributes on "/extensions/okd/NetworkManager-ovs-1.26.6-1.fc33.x86_64.rpm": error setting value of extended attribute "user.Zif.MdChecksum[1616584500]" on "/extensions/okd/NetworkManager-ovs-1.26.6-1.fc33.x86_64.rpm": operation not supported
    * error copying from container: error during bulk transfer for copier.request{Request:"GET", Root:"/", preservedRoot:"/var/lib/containers/storage/overlay/d3c885a10d5ad2ce003363349c76849d48eefc1a9b1c513e24c9b46e0afe5043/merged", rootPrefix:"/var/lib/containers/storage/overlay/d3c885a10d5ad2ce003363349c76849d48eefc1a9b1c513e24c9b46e0afe5043/merged", Directory:"/", preservedDirectory:"/var/lib/containers/storage/overlay/d3c885a10d5ad2ce003363349c76849d48eefc1a9b1c513e24c9b46e0afe5043/merged", Globs:[]string{"/"}, preservedGlobs:[]string{"/var/lib/containers/storage/overlay/d3c885a10d5ad2ce003363349c76849d48eefc1a9b1c513e24c9b46e0afe5043/merged/."}, StatOptions:copier.StatOptions{CheckForArchives:false, Excludes:[]string(nil)}, GetOptions:copier.GetOptions{UIDMap:[]idtools.IDMap(nil), GIDMap:[]idtools.IDMap(nil), Excludes:[]string(nil), ExpandArchives:false, ChownDirs:(*idtools.IDPair)(0xc0005053f0), ChmodDirs:(*os.FileMode)(nil), ChownFiles:(*idtools.IDPair)(0xc000505400), ChmodFiles:(*os.FileMode)(nil), StripSetuidBit:false, StripSetgidBit:false, StripStickyBit:false, StripXattrs:false, KeepDirectoryNames:false, Rename:map[string]string(nil)}, PutOptions:copier.PutOptions{UIDMap:[]idtools.IDMap(nil), GIDMap:[]idtools.IDMap(nil), DefaultDirOwner:(*idtools.IDPair)(nil), DefaultDirMode:(*os.FileMode)(nil), ChownDirs:(*idtools.IDPair)(nil), ChmodDirs:(*os.FileMode)(nil), ChownFiles:(*idtools.IDPair)(nil), ChmodFiles:(*os.FileMode)(nil), StripXattrs:false, IgnoreXattrErrors:false, IgnoreDevices:false, NoOverwriteDirNonDir:false, Rename:map[string]string(nil)}, MkdirOptions:copier.MkdirOptions{UIDMap:[]idtools.IDMap(nil), GIDMap:[]idtools.IDMap(nil), ChownNew:(*idtools.IDPair)(nil), ChmodNew:(*os.FileMode)(nil)}}: copier: get: "/"("/"): error copying /extensions/okd/checkpolicy-3.1-3.fc33.x86_64.rpm: write bulk-writer: broken pipe
: exit status 125; retrying...

Raw log file: machine-config-daemon.txt

I have a must gather of the cluster which is 34MB compressed. I don't want to share it publicly but I am willing to share it with the OKD team members who want to analyze it.

fortinj66 commented 3 years ago

Actually, use {"auths":{"fake":{"auth":"aWQ6cGFzcwo="}}} instead

The issue is that the example fake pull secret in the documentation is bad and during the upgrade oc cannot parse it properly...

The "auth" section needs to be the base64 encoding of "\<id>:\<passwd>"

for example:

echo "id:pass" | base64
aWQ6cGFzcwo=

cat /var/lib/kubelet/config.json
{"auths":{"fake":{"auth":"aWQ6cGFzcwo="}}}
alexanderniebuhr commented 3 years ago

workaround by @fortinj66 works.. we have this issue from 4.7.0-0.okd-2021-03-21-094146 to 4.7.0-0.okd-2021-04-11-124433

I think that this breaks in minor updates is not very good experience, and it worked before with {"auths":{"fake":{"auth": "bar"}}} perfectly

Reamer commented 3 years ago

It is a minor release step for okd, but a major release step for the underlying Podman version (2.x -> 3.0).

cgruver commented 3 years ago

FWIW, I am seeing the same symptoms as listed at the top of this issue with a clean UPI install.

This is using FCOS 33.20210328.3.0

The prominent error is:

Apr 25 12:21:55 okd4-bootstrap.dc2.clg.lab release-image-download.sh[1869]: Error: 2 errors occurred:
Apr 25 12:21:55 okd4-bootstrap.dc2.clg.lab release-image-download.sh[1869]:         * error copying to host: error during bulk transfer for copier.request{Request:"PUT", Root:"/", preservedRoot:"/run/mco-machine-os-content", rootPrefix:"/run/mco-machine-os-content", Directory:"/", preservedDirectory:"/run/mco-machine-os-content", Globs:[]string{}, preservedGlobs:[]string{}, StatOptions:copier.StatOptions{CheckForArchives:false, Excludes:[]string(nil)}, GetOptions:copier.GetOptions{UIDMap:[]idtools.IDMap(nil), GIDMap:[]idtools.IDMap(nil), Excludes:[]string(nil), ExpandArchives:false, ChownDirs:(*idtools.IDPair)(nil), ChmodDirs:(*os.FileMode)(nil), ChownFiles:(*idtools.IDPair)(nil), ChmodFiles:(*os.FileMode)(nil), StripSetuidBit:false, StripSetgidBit:false, StripStickyBit:false, StripXattrs:false, KeepDirectoryNames:false, Rename:map[string]string(nil), NoDerefSymlinks:false, IgnoreUnreadable:false}, PutOptions:copier.PutOptions{UIDMap:[]idtools.IDMap(nil), GIDMap:[]idtools.IDMap(nil), DefaultDirOwner:(*idtools.IDPair)(nil), DefaultDirMode:(*os.FileMode)(nil), ChownDirs:(*idtools.IDPair)(0xc000527be0), ChmodDirs:(*os.FileMode)(nil), ChownFiles:(*idtools.IDPair)(0xc000527bf0), ChmodFiles:(*os.FileMode)(nil), StripXattrs:false, IgnoreXattrErrors:false, IgnoreDevices:true, NoOverwriteDirNonDir:false, Rename:map[string]string{"/":"os-content-657866381"}}, MkdirOptions:copier.MkdirOptions{UIDMap:[]idtools.IDMap(nil), GIDMap:[]idtools.IDMap(nil), ChownNew:(*idtools.IDPair)(nil), ChmodNew:(*os.FileMode)(nil)}}: copier: put: error setting extended attributes on "/os-content-657866381/extensions/okd/NetworkManager-ovs-1.26.6-1.fc33.x86_64.rpm": error setting value of extended attribute "user.Zif.MdChecksum[1616584500]" on "/os-content-657866381/extensions/okd/NetworkManager-ovs-1.26.6-1.fc33.x86_64.rpm": operation not supported
Apr 25 12:21:55 okd4-bootstrap.dc2.clg.lab release-image-download.sh[1869]:         * error copying from container: error during bulk transfer for copier.request{Request:"GET", Root:"/", preservedRoot:"/var/lib/containers/storage/overlay/e12d97851b8676322e815a03b18e91aced64065369c2349c7194651aa306bb69/merged", rootPrefix:"/var/lib/containers/storage/overlay/e12d97851b8676322e815a03b18e91aced64065369c2349c7194651aa306bb69/merged", Directory:"/", preservedDirectory:"/var/lib/containers/storage/overlay/e12d97851b8676322e815a03b18e91aced64065369c2349c7194651aa306bb69/merged", Globs:[]string{"/"}, preservedGlobs:[]string{"/var/lib/containers/storage/overlay/e12d97851b8676322e815a03b18e91aced64065369c2349c7194651aa306bb69/merged/."}, StatOptions:copier.StatOptions{CheckForArchives:false, Excludes:[]string(nil)}, GetOptions:copier.GetOptions{UIDMap:[]idtools.IDMap(nil), GIDMap:[]idtools.IDMap(nil), Excludes:[]string{"dev", "proc", "sys"}, ExpandArchives:false, ChownDirs:(*idtools.IDPair)(0xc00010f570), ChmodDirs:(*os.FileMode)(nil), ChownFiles:(*idtools.IDPair)(0xc00010f580), ChmodFiles:(*os.FileMode)(nil), StripSetuidBit:false, StripSetgidBit:false, StripStickyBit:false, StripXattrs:false, KeepDirectoryNames:true, Rename:map[string]string(nil), NoDerefSymlinks:false, IgnoreUnreadable:false}, PutOptions:copier.PutOptions{UIDMap:[]idtools.IDMap(nil), GIDMap:[]idtools.IDMap(nil), DefaultDirOwner:(*idtools.IDPair)(nil), DefaultDirMode:(*os.FileMode)(nil), ChownDirs:(*idtools.IDPair)(nil), ChmodDirs:(*os.FileMode)(nil), ChownFiles:(*idtools.IDPair)(nil), ChmodFiles:(*os.FileMode)(nil), StripXattrs:false, IgnoreXattrErrors:false, IgnoreDevices:false, NoOverwriteDirNonDir:false, Rename:map[string]string(nil)}, MkdirOptions:copier.MkdirOptions{UIDMap:[]idtools.IDMap(nil), GIDMap:[]idtools.IDMap(nil), ChownNew:(*idtools.IDPair)(nil), ChmodNew:(*os.FileMode)(nil)}}: copier: get: "/"("/"): error copying /extensions/okd/checkpolicy-3.1-3.fc33.x86_64.rpm: write bulk-writer: broken pipe
Apr 25 12:21:55 okd4-bootstrap.dc2.clg.lab release-image-download.sh[1869]: W0425 12:21:55.438527    1869 run.go:44] nice failed: running nice -- ionice -c 3 podman cp 9dc403fa1ced5073a4195fdb8d3884ab47c381e7e9fc2e2b7caa856986bac988:/ /run/mco-machine-os-content/os-content-657866381 failed: Error: 2 errors occurred:

I dropped my initial FCOS version back to 33.20210104.3.0 and now it seems to be working. I'll update when the install completes, of fails.

cgruver commented 3 years ago

Successful UPI install booting from version 33.20210104.3.0

vrutkovs commented 3 years ago

Should be resolved in 4.7.0-0.okd-2021-06-13-090745