rancher / elemental-operator

The Elemental operator is responsible for managing the OS versions and maintaining a machine inventory to assist with edge or baremetal installations.
Apache License 2.0
42 stars 17 forks source link

"Failed creating partition: exit status 1" - log missing essential information #50

Closed kkaempf closed 2 years ago

kkaempf commented 2 years ago

Output from journalctl -u elemental-operator after booting an ISO

-- Logs begin at Fri 2022-07-15 06:15:20 UTC, end at Fri 2022-07-15 06:18:02 UTC. --
Jul 15 06:15:40 localhost systemd[1]: Starting Elemental Operator Registration...
Jul 15 06:15:40 localhost elemental-operator[1600]: time="2022-07-15T06:15:40Z" level=info msg="Using TPMHash 1c9b6e73145c651a5cb73ee0eed13c2300effb030f4b7dbfefe5e7a63db96308 to dial wss://172.17.0.2/elemental/registration/b5msj66rsjm9twbnjnz2vsswpsr76w4hvl2mbn58jp9xn8cc94txsr"
Jul 15 06:15:40 localhost elemental-operator[1600]: Install environment:
Jul 15 06:15:40 localhost elemental-operator[1600]: ELEMENTAL_INSTALL_CLOUD_INIT=/tmp/2415592138.yip
Jul 15 06:15:40 localhost elemental-operator[1600]: ELEMENTAL_DEBUG=true
Jul 15 06:15:40 localhost elemental-operator[1600]: ELEMENTAL_INSTALL_REBOOT=true
Jul 15 06:15:40 localhost elemental-operator[1600]: ELEMENTAL_INSTALL_PASSWORD=<removed>
Jul 15 06:15:40 localhost elemental-operator[1600]: ELEMENTAL_INSTALL_TARGET=/dev/sda
Jul 15 06:15:40 localhost elemental-operator[1600]: ELEMENTAL_REGISTRATION_URL=https://172.17.0.2/elemental/registration/b5msj66rsjm9twbnjnz2vsswpsr76w4hvl2mbn58jp9xn8cc94txsr
Jul 15 06:15:40 localhost elemental-operator[1600]: ELEMENTAL_REGISTRATION_CA_CERT=-----BEGIN CERTIFICATE-----
Jul 15 06:15:40 localhost elemental-operator[1600]: MIIBpzCCAU2gAwIBAgIBADAKBggqhkjOPQQDAjA7MRwwGgYDVQQKExNkeW5hbWlj
Jul 15 06:15:40 localhost elemental-operator[1600]: bGlzdGVuZXItb3JnMRswGQYDVQQDExJkeW5hbWljbGlzdGVuZXItY2EwHhcNMjIw
Jul 15 06:15:40 localhost elemental-operator[1600]: NzA1MDcwOTU2WhcNMzIwNzAyMDcwOTU2WjA7MRwwGgYDVQQKExNkeW5hbWljbGlz
Jul 15 06:15:40 localhost elemental-operator[1600]: dGVuZXItb3JnMRswGQYDVQQDExJkeW5hbWljbGlzdGVuZXItY2EwWTATBgcqhkjO
Jul 15 06:15:40 localhost elemental-operator[1600]: PQIBBggqhkjOPQMBBwNCAAS2wzKIZ5azTvfTMHc4+Zrmy9efM3ndy2ocgycwTCVB
Jul 15 06:15:40 localhost elemental-operator[1600]: rBdbRGH7t1wjGHtpaYh9CR2AflK34+CQ/nibHJLGzH8Go0IwQDAOBgNVHQ8BAf8E
Jul 15 06:15:40 localhost elemental-operator[1600]: BAMCAqQwDwYDVR0TAQH/BAUwAwEB/zAdBgNVHQ4EFgQUa11N1nJQbJ8YX6MtGjva
Jul 15 06:15:40 localhost elemental-operator[1600]: X4KvRIUwCgYIKoZIzj0EAwIDSAAwRQIgNHzToz74FitImN5T4uCTdmOfYOKVwX0+
Jul 15 06:15:40 localhost elemental-operator[1600]: WyDGp78hxx8CIQDFThQWNi1j1JJTOiIJKXpqjrr4ndzWzsQHqRjveFqdBg==
Jul 15 06:15:40 localhost elemental-operator[1600]: -----END CERTIFICATE-----
Jul 15 06:15:40 localhost elemental-operator[1600]: ELEMENTAL_SYSTEM_AGENT_URL=https://172.17.0.2/k8s/clusters/local
Jul 15 06:15:40 localhost elemental-operator[1600]: ELEMENTAL_SYSTEM_AGENT_TOKEN=eyJhbGciOiJSUzI1NiIsImtpZCI6Ik1BU1FFdkhiVzI0ZFFKbF9TeDFNZzVtVHdlRHZURTZ0YmY4MHJCLWtxcjgifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJmbGVldC1kZWZhdWx0Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZWNyZXQubmFtZSI6Im15LW5vZGUtdG9rZW4tazhxeGIiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoibXktbm9kZSIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6ImQ4MTIxOTA5LTJhNWEtNGNkZi05YjQzLWQyYTc1ZTgzZjZjZSIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDpmbGVldC1kZWZhdWx0Om15LW5vZGUifQ.esXZgq5-6vRPwjnXP3g9L-RzXQKrj0LPTc8oqQN5OUCYtr6PU7lbg-43zsN83_J-I89nmVvH2nwvD1BbucZ9B3U7aq_B2wPZ-_rRveUj4Xsbp-7DenXbS5rqXPjra8ieEhJ5Nr62sNPrPfDkm-lCtiu5olHFgroK2KWHfgrU3DizHqVEJfa2eJJ6ZA8hWMiYZ-JzAuDaoB2Q67FbAJhYRDtSXF7ei5KsfbOXb-FBdm94TDunN6qL2I-kgT06xR7_FWlvqLB5dpU7RFA0xN_aWF5dEj5UknZBenGLrgJRRdaQeDz_tHZmKSVBbviC9z7FoQde4ZHmrGaUplTzmN1F6g
Jul 15 06:15:40 localhost elemental-operator[1600]: ELEMENTAL_SYSTEM_AGENT_SECRET_NAME=m-qemu-standard-pc-q35-ich9-2009-not-specified
Jul 15 06:15:40 localhost elemental-operator[1600]: ELEMENTAL_SYSTEM_AGENT_SECRET_NAMESPACE=fleet-default
Jul 15 06:15:40 localhost elemental-operator[1651]: INFO[2022-07-15T06:15:40Z] Starting elemental version 0.0.15
Jul 15 06:15:40 localhost elemental-operator[1651]: INFO[2022-07-15T06:15:40Z] Install called
Jul 15 06:15:40 localhost elemental-operator[1651]: INFO[2022-07-15T06:15:40Z] Running before-install hook
Jul 15 06:15:40 localhost elemental-operator[1651]: INFO[2022-07-15T06:15:40Z] Partitioning device...
Jul 15 06:15:40 localhost elemental-operator[1651]: ERRO[2022-07-15T06:15:40Z] Failed creating partition: exit status 1
Jul 15 06:15:40 localhost elemental-operator[1651]: ERRO[2022-07-15T06:15:40Z] Failed creating bios partition
Jul 15 06:15:40 localhost elemental-operator[1651]: Error: 1 error occurred:
Jul 15 06:15:40 localhost elemental-operator[1651]:         * exit status 1
Jul 15 06:15:40 localhost systemd[1]: elemental-operator.service: Main process exited, code=exited, status=1/FAILURE
Jul 15 06:15:40 localhost elemental-operator[1600]: time="2022-07-15T06:15:40Z" level=fatal msg="failed to write elemental config: exit status 1"
Jul 15 06:15:40 localhost systemd[1]: elemental-operator.service: Failed with result 'exit-code'.
Jul 15 06:15:40 localhost systemd[1]: Failed to start Elemental Operator Registration.

There are several things missing from this log

kkaempf commented 2 years ago

/proc/partitions shows that sda1 was created.

kkaempf commented 2 years ago

Manually creating another partition (by calling fdisk) is successful.

kkaempf commented 2 years ago

I could not reproduce the error :-/

kkaempf commented 2 years ago

This just happened again. A systemctl restart elemental-operator fixed it. So it's a timing issue.

kkaempf commented 2 years ago

udev needs several seconds to settle. πŸ˜‰

davidcassany commented 2 years ago

I did several attempts and I could not reproduce it, I'd like to actually have a failure with the output when having the debug flag enabled. #58 ensures debug flag on elementa-cli is used if the install spec includes a debug: true, I can see in the logs above the ELEMENTAL_DEBUG=true did not have any effect, #58 fixes it.

davidcassany commented 2 years ago

I am not fully convinced this is a timing issue between udev and parted as this happened when creating the first partition of the disk, just right after creating the partition table.

The commands sequence should be equal to:

According to the logs above I guess elemental-cli failed on the third call. Between the first and the second calls we could add a udevadm settle execution, however I can't see how to verify any relevant event is not synched. The second call did not fail and I do not feel like adding a retry mechanism for failures happened on mkpart parted command (third call).

I have some doubts the root issue of this is a timing issue. I'd appreciate if anyone can reproduce it including debug messages (requires #58) and share the logs before merging rancher/elemental-cli#286

kkaempf commented 2 years ago

Happened again, now with more detail:

> cat operator.log 
-- Logs begin at Tue 2022-07-19 06:43:24 UTC, end at Tue 2022-07-19 06:44:59 UTC. --
Jul 19 06:43:45 localhost systemd[1]: Starting Elemental Operator Registration...
Jul 19 06:43:45 localhost elemental-operator[1612]: time="2022-07-19T06:43:45Z" level=info msg="Operator version 0.2.1, commit 3530dc5, commit date git20220718"
Jul 19 06:43:45 localhost elemental-operator[1612]: time="2022-07-19T06:43:45Z" level=info msg="Using TPMHash 1c9b6e73145c651a5cb73ee0eed13c2300effb030f4b7dbfefe5e7a63db96308 to dial wss://172.17.0.2/elemental/registration/g9mscdw8f7m9q4l59rgldx459t9s2ths9gvtqx4z422mm2p6hrv462"
Jul 19 06:43:46 localhost elemental-operator[1656]: DEBU[2022-07-19T06:43:46Z] Starting elemental version 0.0.15 on commit 605346d
Jul 19 06:43:46 localhost elemental-operator[1656]: DEBU[2022-07-19T06:43:46Z] Full config loaded: &v1.RunConfig{
Jul 19 06:43:46 localhost elemental-operator[1656]:   Strict: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:   Reboot: true,
Jul 19 06:43:46 localhost elemental-operator[1656]:   PowerOff: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:   CloudInitPaths: nil,
Jul 19 06:43:46 localhost elemental-operator[1656]:   EjectCD: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:   Config: v1.Config{
Jul 19 06:43:46 localhost elemental-operator[1656]:     Logger: &v1.logrusWrapper{ // p0
Jul 19 06:43:46 localhost elemental-operator[1656]:       Logger: &logrus.Logger{
Jul 19 06:43:46 localhost elemental-operator[1656]:         Out: &os.File{},
Jul 19 06:43:46 localhost elemental-operator[1656]:         Hooks: logrus.LevelHooks{},
Jul 19 06:43:46 localhost elemental-operator[1656]:         Formatter: &logrus.TextFormatter{
Jul 19 06:43:46 localhost elemental-operator[1656]:           ForceColors: true,
Jul 19 06:43:46 localhost elemental-operator[1656]:           DisableColors: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:           ForceQuote: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:           DisableQuote: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:           EnvironmentOverrideColors: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:           DisableTimestamp: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:           FullTimestamp: true,
Jul 19 06:43:46 localhost elemental-operator[1656]:           TimestampFormat: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:           DisableSorting: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:           SortingFunc: ,
Jul 19 06:43:46 localhost elemental-operator[1656]:           DisableLevelTruncation: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:           PadLevelText: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:           QuoteEmptyFields: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:           FieldMap: logrus.FieldMap(nil), // p1
Jul 19 06:43:46 localhost elemental-operator[1656]:           CallerPrettyfier: ,
Jul 19 06:43:46 localhost elemental-operator[1656]:         },
Jul 19 06:43:46 localhost elemental-operator[1656]:         ReportCaller: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:         Level: 5,
Jul 19 06:43:46 localhost elemental-operator[1656]:         ExitFunc: os.Exit,
Jul 19 06:43:46 localhost elemental-operator[1656]:       },
Jul 19 06:43:46 localhost elemental-operator[1656]:     },
Jul 19 06:43:46 localhost elemental-operator[1656]:     Fs: &vfs.osfs{}, // p2
Jul 19 06:43:46 localhost elemental-operator[1656]:     Mounter: &mount.Mounter{},
Jul 19 06:43:46 localhost elemental-operator[1656]:     Runner: &v1.RealRunner{ // p3
Jul 19 06:43:46 localhost elemental-operator[1656]:       Logger: p0,
Jul 19 06:43:46 localhost elemental-operator[1656]:     },
Jul 19 06:43:46 localhost elemental-operator[1656]:     Syscall: &v1.RealSyscall{},
Jul 19 06:43:46 localhost elemental-operator[1656]:     CloudInitRunner: &cloudinit.YipCloudInitRunner{},
Jul 19 06:43:46 localhost elemental-operator[1656]:     Luet: &luet.Luet{
Jul 19 06:43:46 localhost elemental-operator[1656]:       VerifyImageUnpack: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:       TmpDir: "/tmp/elemental-1621924831",
Jul 19 06:43:46 localhost elemental-operator[1656]:     },
Jul 19 06:43:46 localhost elemental-operator[1656]:     Client: &http.Client{},
Jul 19 06:43:46 localhost elemental-operator[1656]:     Cosign: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:     Verify: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:     CosignPubKey: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:     LocalImage: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:     Repos: []v1.Repository{},
Jul 19 06:43:46 localhost elemental-operator[1656]:     Arch: "x86_64",
Jul 19 06:43:46 localhost elemental-operator[1656]:     SquashFsCompressionConfig: []string{
Jul 19 06:43:46 localhost elemental-operator[1656]:       "-comp",
Jul 19 06:43:46 localhost elemental-operator[1656]:       "xz",
Jul 19 06:43:46 localhost elemental-operator[1656]:       "-Xbcj",
Jul 19 06:43:46 localhost elemental-operator[1656]:       "x86",
Jul 19 06:43:46 localhost elemental-operator[1656]:     },
Jul 19 06:43:46 localhost elemental-operator[1656]:     SquashFsNoCompression: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:   },
Jul 19 06:43:46 localhost elemental-operator[1656]: }
Jul 19 06:43:46 localhost elemental-operator[1656]: DEBU[2022-07-19T06:43:46Z] Loaded install spec: &v1.InstallSpec{
Jul 19 06:43:46 localhost elemental-operator[1656]:   Target: "/dev/sda",
Jul 19 06:43:46 localhost elemental-operator[1656]:   Firmware: "bios",
Jul 19 06:43:46 localhost elemental-operator[1656]:   PartTable: "gpt",
Jul 19 06:43:46 localhost elemental-operator[1656]:   Partitions: v1.ElementalPartitions{
Jul 19 06:43:46 localhost elemental-operator[1656]:     BIOS: &v1.Partition{
Jul 19 06:43:46 localhost elemental-operator[1656]:       Name: "bios",
Jul 19 06:43:46 localhost elemental-operator[1656]:       FilesystemLabel: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Size: 1,
Jul 19 06:43:46 localhost elemental-operator[1656]:       FS: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Flags: []string{
Jul 19 06:43:46 localhost elemental-operator[1656]:         "bios_grub",
Jul 19 06:43:46 localhost elemental-operator[1656]:       },
Jul 19 06:43:46 localhost elemental-operator[1656]:       MountPoint: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Path: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Disk: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:     },
Jul 19 06:43:46 localhost elemental-operator[1656]:     EFI: nil,
Jul 19 06:43:46 localhost elemental-operator[1656]:     OEM: &v1.Partition{
Jul 19 06:43:46 localhost elemental-operator[1656]:       Name: "oem",
Jul 19 06:43:46 localhost elemental-operator[1656]:       FilesystemLabel: "COS_OEM",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Size: 64,
Jul 19 06:43:46 localhost elemental-operator[1656]:       FS: "ext4",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Flags: []string{}, // p0
Jul 19 06:43:46 localhost elemental-operator[1656]:       MountPoint: "/run/cos/oem",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Path: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Disk: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:     },
Jul 19 06:43:46 localhost elemental-operator[1656]:     Recovery: &v1.Partition{
Jul 19 06:43:46 localhost elemental-operator[1656]:       Name: "recovery",
Jul 19 06:43:46 localhost elemental-operator[1656]:       FilesystemLabel: "COS_RECOVERY",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Size: 8192,
Jul 19 06:43:46 localhost elemental-operator[1656]:       FS: "ext4",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Flags: p0,
Jul 19 06:43:46 localhost elemental-operator[1656]:       MountPoint: "/run/cos/recovery",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Path: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Disk: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:     },
Jul 19 06:43:46 localhost elemental-operator[1656]:     State: &v1.Partition{
Jul 19 06:43:46 localhost elemental-operator[1656]:       Name: "state",
Jul 19 06:43:46 localhost elemental-operator[1656]:       FilesystemLabel: "COS_STATE",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Size: 15360,
Jul 19 06:43:46 localhost elemental-operator[1656]:       FS: "ext4",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Flags: p0,
Jul 19 06:43:46 localhost elemental-operator[1656]:       MountPoint: "/run/cos/state",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Path: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Disk: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:     },
Jul 19 06:43:46 localhost elemental-operator[1656]:     Persistent: &v1.Partition{
Jul 19 06:43:46 localhost elemental-operator[1656]:       Name: "persistent",
Jul 19 06:43:46 localhost elemental-operator[1656]:       FilesystemLabel: "COS_PERSISTENT",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Size: 0,
Jul 19 06:43:46 localhost elemental-operator[1656]:       FS: "ext4",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Flags: p0,
Jul 19 06:43:46 localhost elemental-operator[1656]:       MountPoint: "/run/cos/persistent",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Path: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:       Disk: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:     },
Jul 19 06:43:46 localhost elemental-operator[1656]:   },
Jul 19 06:43:46 localhost elemental-operator[1656]:   NoFormat: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:   Force: false,
Jul 19 06:43:46 localhost elemental-operator[1656]:   CloudInit: "/tmp/767908735.yip",
Jul 19 06:43:46 localhost elemental-operator[1656]:   Iso: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:   GrubDefEntry: "cOS",
Jul 19 06:43:46 localhost elemental-operator[1656]:   Tty: "tty1",
Jul 19 06:43:46 localhost elemental-operator[1656]:   Active: v1.Image{
Jul 19 06:43:46 localhost elemental-operator[1656]:     File: "/run/cos/state/cOS/active.img",
Jul 19 06:43:46 localhost elemental-operator[1656]:     Label: "COS_ACTIVE",
Jul 19 06:43:46 localhost elemental-operator[1656]:     Size: 3072,
Jul 19 06:43:46 localhost elemental-operator[1656]:     FS: "ext2",
Jul 19 06:43:46 localhost elemental-operator[1656]:     Source: &v1.ImageSource{},
Jul 19 06:43:46 localhost elemental-operator[1656]:     MountPoint: "/run/cos/active",
Jul 19 06:43:46 localhost elemental-operator[1656]:     LoopDevice: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:   },
Jul 19 06:43:46 localhost elemental-operator[1656]:   Recovery: v1.Image{
Jul 19 06:43:46 localhost elemental-operator[1656]:     File: "/run/cos/recovery/cOS/recovery.img",
Jul 19 06:43:46 localhost elemental-operator[1656]:     Label: "COS_SYSTEM",
Jul 19 06:43:46 localhost elemental-operator[1656]:     Size: 0,
Jul 19 06:43:46 localhost elemental-operator[1656]:     FS: "ext2",
Jul 19 06:43:46 localhost elemental-operator[1656]:     Source: &v1.ImageSource{},
Jul 19 06:43:46 localhost elemental-operator[1656]:     MountPoint: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:     LoopDevice: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:   },
Jul 19 06:43:46 localhost elemental-operator[1656]:   Passive: v1.Image{
Jul 19 06:43:46 localhost elemental-operator[1656]:     File: "/run/cos/state/cOS/passive.img",
Jul 19 06:43:46 localhost elemental-operator[1656]:     Label: "COS_PASSIVE",
Jul 19 06:43:46 localhost elemental-operator[1656]:     Size: 0,
Jul 19 06:43:46 localhost elemental-operator[1656]:     FS: "ext2",
Jul 19 06:43:46 localhost elemental-operator[1656]:     Source: &v1.ImageSource{},
Jul 19 06:43:46 localhost elemental-operator[1656]:     MountPoint: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:     LoopDevice: "",
Jul 19 06:43:46 localhost elemental-operator[1656]:   },
Jul 19 06:43:46 localhost elemental-operator[1656]:   GrubConf: "/etc/cos/grub.cfg",
Jul 19 06:43:46 localhost elemental-operator[1656]: }
Jul 19 06:43:46 localhost elemental-operator[1656]: INFO[2022-07-19T06:43:46Z] Install called
Jul 19 06:43:46 localhost elemental-operator[1656]: INFO[2022-07-19T06:43:46Z] Running before-install hook
Jul 19 06:43:46 localhost elemental-operator[1656]: DEBU[2022-07-19T06:43:46Z] Running cmd: 'blkdeactivate --lvmoptions retry,wholevg --dmoptions force,retry --errors'
Jul 19 06:43:46 localhost elemental-operator[1656]: DEBU[2022-07-19T06:43:46Z] blkdeactivate command output: Deactivating block devices:
Jul 19 06:43:46 localhost elemental-operator[1656]: INFO[2022-07-19T06:43:46Z] Partitioning device...
Jul 19 06:43:46 localhost elemental-operator[1656]: DEBU[2022-07-19T06:43:46Z] Running cmd: 'parted --script --machine -- /dev/sda unit s mklabel gpt'
Jul 19 06:43:46 localhost elemental-operator[1656]: DEBU[2022-07-19T06:43:46Z] Running cmd: 'parted --script --machine -- /dev/sda unit s print'
Jul 19 06:43:46 localhost elemental-operator[1656]: DEBU[2022-07-19T06:43:46Z] Adding partition bios
Jul 19 06:43:46 localhost elemental-operator[1656]: DEBU[2022-07-19T06:43:46Z] Running cmd: 'parted --script --machine -- /dev/sda unit s mkpart bios  2048 4095 set 1 bios_grub on'
Jul 19 06:43:46 localhost elemental-operator[1656]: DEBU[2022-07-19T06:43:46Z] partitioner output: Error: Partition(s) 1 on /dev/sda have been written, but we have been unable to inform the kernel of the change, probably because it/they are in use.  As a result, the old partition(s) will remain in use.  You should reboot now before making further changes.
Jul 19 06:43:46 localhost elemental-operator[1656]: ERRO[2022-07-19T06:43:46Z] Failed creating partition: exit status 1
Jul 19 06:43:46 localhost elemental-operator[1656]: ERRO[2022-07-19T06:43:46Z] Failed creating bios partition
Jul 19 06:43:46 localhost elemental-operator[1656]: Error: 1 error occurred:
Jul 19 06:43:46 localhost elemental-operator[1656]:         * exit status 1
Jul 19 06:43:46 localhost systemd[1]: elemental-operator.service: Main process exited, code=exited, status=1/FAILURE
Jul 19 06:43:46 localhost elemental-operator[1612]: time="2022-07-19T06:43:46Z" level=fatal msg="failed calling elemental client: exit status 1"
Jul 19 06:43:46 localhost systemd[1]: elemental-operator.service: Failed with result 'exit-code'.
Jul 19 06:43:46 localhost systemd[1]: Failed to start Elemental Operator Registration.
kkaempf commented 2 years ago

Partition got created

# cat /proc/partitions 
major minor  #blocks  name

   7        0     384448 loop0
   8        0   41943040 sda
   8        1       1024 sda1
  11        0     467968 sr0
davidcassany commented 2 years ago

Awesome thanks for the report including the logs... πŸ€” I actually managed to reproduce something similar, to me it feels like some sort of race condition during boot, in my case it fails on the parted --script --machine -- /dev/vda unit s mklabel gpt call with the same error message. It only fails if there was a previous installation on the disk. Still debugging, but closer to the root cause.

At least for the issue I experienced I can asses that changes in https://github.com/rancher/elemental-cli/pull/286 are having no effect.

davidcassany commented 2 years ago

@kkaempf here is a request to update the elemental-operator.service https://build.opensuse.org/request/show/990105 From my tests looks like the elemental-operator.service is having some sort of bad interaction with the cos-setup-network.service, they both are executed in parallel.

This issue brings me to the point if the service unit file should belong to this repository instead of OBS, or even more, do we actually need the service unit file? wouldn't it be more consistent to call elemental-operator register as part of one of our boot stages, more specific we could call it as part of a cloud-init config network.after stage so it would be last thing to do as part of cos-setup-network.service. This we concentrate our boot procedure within the cloud-init stages and its config files.

kkaempf commented 2 years ago

I'm fine to drop the .service file at any time :wink:

davidcassany commented 2 years ago

The provided solution in https://build.opensuse.org/request/show/990105 was incomplete, my bad... now the service does not start automatically, on my local tests I just changed the unit service file but I did not realize that the symlinks to the multi-user.target were already created, hence new builds including https://build.opensuse.org/request/show/990105 do not start elemental-operator at boot... my bad, I am bout to provide a new SR.

kkaempf commented 2 years ago

Ah, that explains it πŸ˜†

davidcassany commented 2 years ago

Ah, that explains it laughing

My bad sorry.... :pray: In may local env I was just placing the new unit file on top and I did not realize that there was some spurious symlink already created during the image build while running systemctl enable elemantl-operator.service. Here is the https://build.opensuse.org/request/show/990272 that should fix the regression and hopefully the former issue.

Gonna add a card to drop this service and call elemental-operator based on our cloud-init config files, think this approach is way more explicit and easier to maintain, however requires few changes regarding the /etc/motd and find a proper way store read elemental-operator logs. That's why a separate specific card makes sense.

kkaempf commented 2 years ago

Gonna add a card to drop this service and call elemental-operator based on our cloud-init config files,

πŸ‘πŸ»

kkaempf commented 2 years ago

After doing more than half a dozen installs without problem, I'd consider this done.

davidcassany commented 2 years ago

after the change into running elemental-operator as part of cloud-init this issue appeared again. Not very often, but still...

This are the relevant logs I get time to time:


Jul 22 15:24:17 localhost elemental[1558]: INFO[2022-07-22T15:24:17Z] Done executing stage 'network'
Jul 22 15:24:17 localhost elemental[1558]: INFO[2022-07-22T15:24:17Z] Running stage: network.after
Jul 22 15:24:17 localhost elemental[1558]: INFO[2022-07-22T15:24:17Z] Executing /system/oem/01_elemental-rootfs.yaml
Jul 22 15:24:17 localhost elemental[1558]: INFO[2022-07-22T15:24:17Z] Executing /system/oem/04_accounting.yaml
Jul 22 15:24:17 localhost elemental[1558]: INFO[2022-07-22T15:24:17Z] Executing /system/oem/05_motd_and_autologin.yaml
Jul 22 15:24:17 localhost elemental[1558]: INFO[2022-07-22T15:24:17Z] Executing /system/oem/05_network.yaml
Jul 22 15:24:17 localhost elemental[1558]: INFO[2022-07-22T15:24:17Z] Executing /system/oem/06_recovery.yaml
Jul 22 15:24:17 localhost elemental[1558]: INFO[2022-07-22T15:24:17Z] Executing /system/oem/07_live.yaml
Jul 22 15:24:17 localhost elemental[1558]: INFO[2022-07-22T15:24:17Z] Executing /system/oem/08_boot_assessment.yaml
Jul 22 15:24:17 localhost elemental[1558]: INFO[2022-07-22T15:24:17Z] Executing /system/oem/09_services.yaml
Jul 22 15:24:17 localhost elemental[1558]: INFO[2022-07-22T15:24:17Z] Executing /system/oem/99_elemental-operator.yaml
Jul 22 15:24:17 localhost elemental[1558]: INFO[2022-07-22T15:24:17Z] Applying 'Elemental operator bootstrap' for stage 'network.after'. Total stages: 1
Jul 22 15:24:17 localhost elemental[1558]: INFO[2022-07-22T15:24:17Z] Processing stage step ''. ( commands: 1, files: 0, ... )
Jul 22 15:24:18 localhost elemental[1680]: time="2022-07-22T15:24:18Z" level=info msg="Operator version 0.3.0, commit f2ab68c, commit date git20220722"
Jul 22 15:24:18 localhost elemental[1680]: time="2022-07-22T15:24:18Z" level=info msg="Using TPMHash f6c611b8f2310511b1f1337e3b930754f9d9cf2863771f178636f3ce3de91da5 to dial wss://ranchersuse/elemental/registration/blz4scw6dxwvmgdpgx8grctf8tlm4wbcp5ntctjf5tjgz668rjw6fk"
Jul 22 15:24:19 localhost elemental[1680]: time="2022-07-22T15:24:19Z" level=debug msg="Fetched configuration from manager cluster:\nelemental:\n  install:\n    device: /dev/vda\n    debug: true\n    reboot: true\n  registration:\n    url: https://ranchersuse/elemental/registration/blz4scw6dxwvmgdpgx8grctf8tlm4wbcp5ntctjf5tjgz668rjw6fk\n    cacert: |-\n      -----BEGIN CERTIFICATE-----\n      MIIBpzCCAU2gAwIBAgIBADAKBggqhkjOPQQDAjA7MRwwGgYDVQQKExNkeW5hbWlj\n      bGlzdGVuZXItb3JnMRswGQYDVQQDExJkeW5hbWljbGlzdGVuZXItY2EwHhcNMjIw\n      NzE0MDgzMTM3WhcNMzIwNzExMDgzMTM3WjA7MRwwGgYDVQQKExNkeW5hbWljbGlz\n      dGVuZXItb3JnMRswGQYDVQQDExJkeW5hbWljbGlzdGVuZXItY2EwWTATBgcqhkjO\n      PQIBBggqhkjOPQMBBwNCAARALzVe/qVHEhX4doODreuKydYKvLHRTJ+eG85jIimV\n      VN001MshCh3KusuYBuGH8GSY+Zz4ieJDICEnBOrdCVp2o0IwQDAOBgNVHQ8BAf8E\n      BAMCAqQwDwYDVR0TAQH/BAUwAwEB/zAdBgNVHQ4EFgQUa+46+mWZVkp/r/S9wZdo\n      KS7ZNsswCgYIKoZIzj0EAwIDSAAwRQIhAMKAV3P9mLAu0Esrcg0iDUq+Q46+H4/r\n      XXseh20+NnXHAiABBrj0T6CiLScUxxRr3m9ciP4bIHDKQPMJPRtOJIK9BA==\n      -----END CERTIFICATE-----\n    emulatetpm: false\n    emulatedtpmseed: 0\n    nosmbios: false\n    labels: {}\n  systemagent:\n    url: https://ranchersuse/k8s/clusters/local\n    token: eyJhbGciOiJSUzI1NiIsImtpZCI6IjBpQkxscFdnbGFhaVk3LVU4NmlkcjUxQkJCdFZPTXpibFdqWmtNVlVubDQifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJmbGVldC1kZWZhdWx0Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZWNyZXQubmFtZSI6InRlc3Qtbm9kZXMtdG9rZW4tNng1MnciLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoidGVzdC1ub2RlcyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6IjQ4ZmZiNTcyLTFmNTAtNDI3Ny04ZWFiLWMxN2E3NzRlYTQ1ZiIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDpmbGVldC1kZWZhdWx0OnRlc3Qtbm9kZXMifQ.ZzehNpPW1AnWMQZLkquDjUdsry9jdoJIBzbR-IZ1gJ9czhoIjTpYSMmONx8m-Or7413XB0hHGwYKSrIYCrR6VTHpTpgLR5E9chvZ9E0tDIRkRnO-ERKt5QWOP5vwf-cb5QnMwL0MZNeCQRd2pIK90HaDCvUsVpbFFvnM4LL6rFMpQCR8HlB_5_DdSHwtMLfF55EAVeTopuiglQAX126CBfrLekT-7OmNofFT9IssCedzWGq0eDMPhmyVE46dP86ecN4ZNLEzLWls66GT65jtTq26M83x4OG1dUkbtzTh1SnH5WgIe-iEALNmLejKp2GnXULPtB88vbMQHKSBOL8e2Q\n    secretname: m-qemu-standard-pc-q35-ich9-2009-not-specified\n    secretnamespace: fleet-default\ncloud-config:\n  users:\n  - name: root\n    passwd: password\n  - name: dcassany\n    passwd: $6$z7WFaY8420.r91h3$io2HBMke9VNW6uodX/Ci4VotzWk89eZ8AvseLHekUSHPo55Qz3rv4jdrA9M6ONRAvjzGTJe1Wg.TC8YEDL0yU1\n\n\n"
Jul 22 15:24:19 localhost elemental[1680]: time="2022-07-22T15:24:19Z" level=debug msg="Computed environment variables:"
Jul 22 15:24:19 localhost elemental[1680]: time="2022-07-22T15:24:19Z" level=debug msg="ELEMENTAL_DEBUG=true\n"
Jul 22 15:24:19 localhost elemental[1680]: time="2022-07-22T15:24:19Z" level=debug msg="ELEMENTAL_REBOOT=true\n"
Jul 22 15:24:19 localhost elemental[1680]: time="2022-07-22T15:24:19Z" level=debug msg="ELEMENTAL_INSTALL_TARGET=/dev/vda\n"
Jul 22 15:24:19 localhost elemental[1687]: DEBU[2022-07-22T15:24:19Z] Starting elemental version 0.0.15 on commit 605346d
Jul 22 15:24:19 localhost elemental[1687]: DEBU[2022-07-22T15:24:19Z] Full config loaded: &v1.RunConfig{
Jul 22 15:24:19 localhost elemental[1687]:   Strict: false,
Jul 22 15:24:19 localhost elemental[1687]:   Reboot: true,
Jul 22 15:24:19 localhost elemental[1687]:   PowerOff: false,
Jul 22 15:24:19 localhost elemental[1687]:   CloudInitPaths: nil,
Jul 22 15:24:19 localhost elemental[1687]:   EjectCD: false,
Jul 22 15:24:19 localhost elemental[1687]:   Config: v1.Config{
Jul 22 15:24:19 localhost elemental[1687]:     Logger: &v1.logrusWrapper{ // p0
Jul 22 15:24:19 localhost elemental[1687]:       Logger: &logrus.Logger{
Jul 22 15:24:19 localhost elemental[1687]:         Out: &os.File{},
Jul 22 15:24:19 localhost elemental[1687]:         Hooks: logrus.LevelHooks{},
Jul 22 15:24:19 localhost elemental[1687]:         Formatter: &logrus.TextFormatter{
Jul 22 15:24:19 localhost elemental[1687]:           ForceColors: true,
Jul 22 15:24:19 localhost elemental[1687]:           DisableColors: false,
Jul 22 15:24:19 localhost elemental[1687]:           ForceQuote: false,
Jul 22 15:24:19 localhost elemental[1687]:           DisableQuote: false,
Jul 22 15:24:19 localhost elemental[1687]:           EnvironmentOverrideColors: false,
Jul 22 15:24:19 localhost elemental[1687]:           DisableTimestamp: false,
Jul 22 15:24:19 localhost elemental[1687]:           FullTimestamp: true,
Jul 22 15:24:19 localhost elemental[1687]:           TimestampFormat: "",
Jul 22 15:24:19 localhost elemental[1687]:           DisableSorting: false,
Jul 22 15:24:19 localhost elemental[1687]:           SortingFunc: ,
Jul 22 15:24:19 localhost elemental[1687]:           DisableLevelTruncation: false,
Jul 22 15:24:19 localhost elemental[1687]:           PadLevelText: false,
Jul 22 15:24:19 localhost elemental[1687]:           QuoteEmptyFields: false,
Jul 22 15:24:19 localhost elemental[1687]:           FieldMap: logrus.FieldMap(nil), // p1
Jul 22 15:24:19 localhost elemental[1687]:           CallerPrettyfier: ,
Jul 22 15:24:19 localhost elemental[1687]:         },
Jul 22 15:24:19 localhost elemental[1687]:         ReportCaller: false,
Jul 22 15:24:19 localhost elemental[1687]:         Level: 5,
Jul 22 15:24:19 localhost elemental[1687]:         ExitFunc: os.Exit,
Jul 22 15:24:19 localhost elemental[1687]:       },
Jul 22 15:24:19 localhost elemental[1687]:     },
Jul 22 15:24:19 localhost elemental[1687]:     Fs: &vfs.osfs{}, // p2
Jul 22 15:24:19 localhost elemental[1687]:     Mounter: &mount.Mounter{},
Jul 22 15:24:19 localhost elemental[1687]:     Runner: &v1.RealRunner{ // p3
Jul 22 15:24:19 localhost elemental[1687]:       Logger: p0,
Jul 22 15:24:19 localhost elemental[1687]:     },
Jul 22 15:24:19 localhost elemental[1687]:     Syscall: &v1.RealSyscall{},
Jul 22 15:24:19 localhost elemental[1687]:     CloudInitRunner: &cloudinit.YipCloudInitRunner{},
Jul 22 15:24:19 localhost elemental[1687]:     Luet: &luet.Luet{
Jul 22 15:24:19 localhost elemental[1687]:       VerifyImageUnpack: false,
Jul 22 15:24:19 localhost elemental[1687]:       TmpDir: "/tmp/elemental-4140224633",
Jul 22 15:24:19 localhost elemental[1687]:     },
Jul 22 15:24:19 localhost elemental[1687]:     Client: &http.Client{},
Jul 22 15:24:19 localhost elemental[1687]:     Cosign: false,
Jul 22 15:24:19 localhost elemental[1687]:     Verify: false,
Jul 22 15:24:19 localhost elemental[1687]:     CosignPubKey: "",
Jul 22 15:24:19 localhost elemental[1687]:     LocalImage: false,
Jul 22 15:24:19 localhost elemental[1687]:     Repos: []v1.Repository{},
Jul 22 15:24:19 localhost elemental[1687]:     Arch: "x86_64",
Jul 22 15:24:19 localhost elemental[1687]:     SquashFsCompressionConfig: []string{
Jul 22 15:24:19 localhost elemental[1687]:       "-comp",
Jul 22 15:24:19 localhost elemental[1687]:       "xz",
Jul 22 15:24:19 localhost elemental[1687]:       "-Xbcj",
Jul 22 15:24:19 localhost elemental[1687]:       "x86",
Jul 22 15:24:19 localhost elemental[1687]:     },
Jul 22 15:24:19 localhost elemental[1687]:     SquashFsNoCompression: false,
Jul 22 15:24:19 localhost elemental[1687]:   },
Jul 22 15:24:19 localhost elemental[1687]: }
Jul 22 15:24:19 localhost elemental[1687]: DEBU[2022-07-22T15:24:19Z] Loaded install spec: &v1.InstallSpec{
Jul 22 15:24:19 localhost elemental[1687]:   Target: "/dev/vda",
Jul 22 15:24:19 localhost elemental[1687]:   Firmware: "bios",
Jul 22 15:24:19 localhost elemental[1687]:   PartTable: "gpt",
Jul 22 15:24:19 localhost elemental[1687]:   Partitions: v1.ElementalPartitions{
Jul 22 15:24:19 localhost elemental[1687]:     BIOS: &v1.Partition{
Jul 22 15:24:19 localhost elemental[1687]:       Name: "bios",
Jul 22 15:24:19 localhost elemental[1687]:       FilesystemLabel: "",
Jul 22 15:24:19 localhost elemental[1687]:       Size: 1,
Jul 22 15:24:19 localhost elemental[1687]:       FS: "",
Jul 22 15:24:19 localhost elemental[1687]:       Flags: []string{
Jul 22 15:24:19 localhost elemental[1687]:         "bios_grub",
Jul 22 15:24:19 localhost elemental[1687]:       },
Jul 22 15:24:19 localhost elemental[1687]:       MountPoint: "",
Jul 22 15:24:19 localhost elemental[1687]:       Path: "",
Jul 22 15:24:19 localhost elemental[1687]:       Disk: "",
Jul 22 15:24:19 localhost elemental[1687]:     },
Jul 22 15:24:19 localhost elemental[1687]:     EFI: nil,
Jul 22 15:24:19 localhost elemental[1687]:     OEM: &v1.Partition{
Jul 22 15:24:19 localhost elemental[1687]:       Name: "oem",
Jul 22 15:24:19 localhost elemental[1687]:       FilesystemLabel: "COS_OEM",
Jul 22 15:24:19 localhost elemental[1687]:       Size: 64,
Jul 22 15:24:19 localhost elemental[1687]:       FS: "ext4",
Jul 22 15:24:19 localhost elemental[1687]:       Flags: []string{}, // p0
Jul 22 15:24:19 localhost elemental[1687]:       MountPoint: "/run/cos/oem",
Jul 22 15:24:19 localhost elemental[1687]:       Path: "",
Jul 22 15:24:19 localhost elemental[1687]:       Disk: "",
Jul 22 15:24:19 localhost elemental[1687]:     },
Jul 22 15:24:19 localhost elemental[1687]:     Recovery: &v1.Partition{
Jul 22 15:24:19 localhost elemental[1687]:       Name: "recovery",
Jul 22 15:24:19 localhost elemental[1687]:       FilesystemLabel: "COS_RECOVERY",
Jul 22 15:24:19 localhost elemental[1687]:       Size: 8192,
Jul 22 15:24:19 localhost elemental[1687]:       FS: "ext4",
Jul 22 15:24:19 localhost elemental[1687]:       Flags: p0,
Jul 22 15:24:19 localhost elemental[1687]:       MountPoint: "/run/cos/recovery",
Jul 22 15:24:19 localhost elemental[1687]:       Path: "",
Jul 22 15:24:19 localhost elemental[1687]:       Disk: "",
Jul 22 15:24:19 localhost elemental[1687]:     },
Jul 22 15:24:19 localhost elemental[1687]:     State: &v1.Partition{
Jul 22 15:24:19 localhost elemental[1687]:       Name: "state",
Jul 22 15:24:19 localhost elemental[1687]:       FilesystemLabel: "COS_STATE",
Jul 22 15:24:19 localhost elemental[1687]:       Size: 15360,
Jul 22 15:24:19 localhost elemental[1687]:       FS: "ext4",
Jul 22 15:24:19 localhost elemental[1687]:       Flags: p0,
Jul 22 15:24:19 localhost elemental[1687]:       MountPoint: "/run/cos/state",
Jul 22 15:24:19 localhost elemental[1687]:       Path: "",
Jul 22 15:24:19 localhost elemental[1687]:       Disk: "",
Jul 22 15:24:19 localhost elemental[1687]:     },
Jul 22 15:24:19 localhost elemental[1687]:     Persistent: &v1.Partition{
Jul 22 15:24:19 localhost elemental[1687]:       Name: "persistent",
Jul 22 15:24:19 localhost elemental[1687]:       FilesystemLabel: "COS_PERSISTENT",
Jul 22 15:24:19 localhost elemental[1687]:       Size: 0,
Jul 22 15:24:19 localhost elemental[1687]:       FS: "ext4",
Jul 22 15:24:19 localhost elemental[1687]:       Flags: p0,
Jul 22 15:24:19 localhost elemental[1687]:       MountPoint: "/run/cos/persistent",
Jul 22 15:24:19 localhost elemental[1687]:       Path: "",
Jul 22 15:24:19 localhost elemental[1687]:       Disk: "",
Jul 22 15:24:19 localhost elemental[1687]:     },
Jul 22 15:24:19 localhost elemental[1687]:   },
Jul 22 15:24:19 localhost elemental[1687]:   NoFormat: false,
Jul 22 15:24:19 localhost elemental[1687]:   Force: false,
Jul 22 15:24:19 localhost elemental[1687]:   CloudInit: []string{
Jul 22 15:24:19 localhost elemental[1687]:     "/tmp/3429870971.yaml",
Jul 22 15:24:19 localhost elemental[1687]:     "/tmp/825229707.yaml",
Jul 22 15:24:19 localhost elemental[1687]:   },
Jul 22 15:24:19 localhost elemental[1687]:   Iso: "",
Jul 22 15:24:19 localhost elemental[1687]:   GrubDefEntry: "cOS",
Jul 22 15:24:19 localhost elemental[1687]:   Tty: "tty1",
Jul 22 15:24:19 localhost elemental[1687]:   Active: v1.Image{
Jul 22 15:24:19 localhost elemental[1687]:     File: "/run/cos/state/cOS/active.img",
Jul 22 15:24:19 localhost elemental[1687]:     Label: "COS_ACTIVE",
Jul 22 15:24:19 localhost elemental[1687]:     Size: 3072,
Jul 22 15:24:19 localhost elemental[1687]:     FS: "ext2",
Jul 22 15:24:19 localhost elemental[1687]:     Source: &v1.ImageSource{},
Jul 22 15:24:19 localhost elemental[1687]:     MountPoint: "/run/cos/active",
Jul 22 15:24:19 localhost elemental[1687]:     LoopDevice: "",
Jul 22 15:24:19 localhost elemental[1687]:   },
Jul 22 15:24:19 localhost elemental[1687]:   Recovery: v1.Image{
Jul 22 15:24:19 localhost elemental[1687]:     File: "/run/cos/recovery/cOS/recovery.img",
Jul 22 15:24:19 localhost elemental[1687]:     Label: "COS_SYSTEM",
Jul 22 15:24:19 localhost elemental[1687]:     Size: 0,
Jul 22 15:24:19 localhost elemental[1687]:     FS: "ext2",
Jul 22 15:24:19 localhost elemental[1687]:     Source: &v1.ImageSource{},
Jul 22 15:24:19 localhost elemental[1687]:     MountPoint: "",
Jul 22 15:24:19 localhost elemental[1687]:     LoopDevice: "",
Jul 22 15:24:19 localhost elemental[1687]:   },
Jul 22 15:24:19 localhost elemental[1687]:   Passive: v1.Image{
Jul 22 15:24:19 localhost elemental[1687]:     File: "/run/cos/state/cOS/passive.img",
Jul 22 15:24:19 localhost elemental[1687]:     Label: "COS_PASSIVE",
Jul 22 15:24:19 localhost elemental[1687]:     Size: 0,
Jul 22 15:24:19 localhost elemental[1687]:     FS: "ext2",
Jul 22 15:24:19 localhost elemental[1687]:     Source: &v1.ImageSource{},
Jul 22 15:24:19 localhost elemental[1687]:     MountPoint: "",
Jul 22 15:24:19 localhost elemental[1687]:     LoopDevice: "",
Jul 22 15:24:19 localhost elemental[1687]:   },
Jul 22 15:24:19 localhost elemental[1687]:   GrubConf: "/etc/cos/grub.cfg",
Jul 22 15:24:19 localhost elemental[1687]: }
Jul 22 15:24:19 localhost elemental[1687]: INFO[2022-07-22T15:24:19Z] Install called
Jul 22 15:24:19 localhost elemental[1687]: INFO[2022-07-22T15:24:19Z] Running before-install hook
Jul 22 15:24:19 localhost elemental[1687]: DEBU[2022-07-22T15:24:19Z] Running cmd: 'blkdeactivate --lvmoptions retry,wholevg --dmoptions force,retry --errors'
Jul 22 15:24:19 localhost elemental[1687]: DEBU[2022-07-22T15:24:19Z] blkdeactivate command output: Deactivating block devices:
Jul 22 15:24:19 localhost elemental[1687]: INFO[2022-07-22T15:24:19Z] Partitioning device...
Jul 22 15:24:19 localhost elemental[1687]: DEBU[2022-07-22T15:24:19Z] Running cmd: 'parted --script --machine -- /dev/vda unit s mklabel gpt'
Jul 22 15:24:29 localhost elemental[1687]: ERRO[2022-07-22T15:24:29Z] Failed creating new partition table: Error: Partition(s) 1, 2, 3, 4, 5 on /dev/vda have been written, but we have been unable to inform the kernel of the change, probably because it/they are in use.  As a result, the old partition(s) will remain in use.  You should reboot now before making further changes.
Jul 22 15:24:29 localhost elemental[1687]: Error: 1 error occurred:
Jul 22 15:24:29 localhost elemental[1687]:         * exit status 1
Jul 22 15:24:29 localhost elemental[1680]: time="2022-07-22T15:24:29Z" level=fatal msg="failed calling elemental client: exit status 1"
Jul 22 15:24:29 localhost elemental[1558]: ERRO[2022-07-22T15:24:29Z] : failed to run systemd-cat -t elemental elemental-operator register --debug /run/initramfs/live/: exit status 1
Jul 22 15:24:29 localhost elemental[1558]: ERRO[2022-07-22T15:24:29Z] 1 error occurred:
Jul 22 15:24:29 localhost elemental[1558]:         * failed to run systemd-cat -t elemental elemental-operator register --debug /run/initramfs/live/: exit status 1
Jul 22 15:24:29 localhost elemental[1558]:  
Jul 22 15:24:29 localhost elemental[1558]: INFO[2022-07-22T15:24:29Z] Stage 'network.after'. Defined stages: 1. Errors: true
Jul 22 15:24:29 localhost elemental[1558]: INFO[2022-07-22T15:24:29Z] Executing /oem/install-hook.yaml
Jul 22 15:24:29 localhost elemental[1558]: INFO[2022-07-22T15:24:29Z] Done executing stage 'network.after'
Jul 22 15:24:29 localhost elemental[1558]: INFO[2022-07-22T15:24:29Z] Running stage: network.before
Jul 22 15:24:29 localhost elemental[1558]: INFO[2022-07-22T15:24:29Z] Executing BOOT_IMAGE=(cd)/boot/kernel.xz cdroot root=live:CDLABEL=COS_LIVE rd.live.dir=/ rd.live.squashimg=rootfs.squashfs console=tty1 console=ttyS0 rd.cos.disable
Jul 22 15:24:29 localhost elemental[1558]: INFO[2022-07-22T15:24:29Z] Done executing stage 'network.before'
Jul 22 15:24:29 localhost elemental[1558]: INFO[2022-07-22T15:24:29Z] Running stage: network
Jul 22 15:24:29 localhost elemental[1558]: INFO[2022-07-22T15:24:29Z] Executing BOOT_IMAGE=(cd)/boot/kernel.xz cdroot root=live:CDLABEL=COS_LIVE rd.live.dir=/ rd.live.squashimg=rootfs.squashfs console=tty1 console=ttyS0 rd.cos.disable
Jul 22 15:24:29 localhost elemental[1558]: INFO[2022-07-22T15:24:29Z] Done executing stage 'network'
Jul 22 15:24:29 localhost elemental[1558]: INFO[2022-07-22T15:24:29Z] Running stage: network.after
Jul 22 15:24:29 localhost elemental[1558]: INFO[2022-07-22T15:24:29Z] Executing BOOT_IMAGE=(cd)/boot/kernel.xz cdroot root=live:CDLABEL=COS_LIVE rd.live.dir=/ rd.live.squashimg=rootfs.squashfs console=tty1 console=ttyS0 rd.cos.disable
Jul 22 15:24:29 localhost elemental[1558]: INFO[2022-07-22T15:24:29Z] Done executing stage 'network.after'

IMHO this also raises a something we probably should consider. What is the expected behavior if the installation fails? Certainly we need to fix and understand the former issue, however installation failures will certainly happen this goes up to scale at some point. So I was wondering if some sort of reboot and retry mechanism is something desirable for the registration and provisioning process.

kkaempf commented 2 years ago

A systemctl restart elemental-operator helped in all such cases I encountered. Going back to a systemd service (with restart-on-failure - if this is even possible πŸ€” ) would be one option.

davidcassany commented 2 years ago

I am pretty much convinced the issue is with the datasource plugin which is executed at network stage. More explicit I believe the problem is related to the cdrom provider which attempts to read all devices and leaves the files opened. In fact, right after booting and before elemental-operator kicks in a call to lsof | grep vda results into something like:

elemental 1558                            root    9r      BLK              253,1   0t1048576      16294 /dev/vda1
elemental 1558                            root   10r      BLK              253,2  0t67108864      16295 /dev/vda2
elemental 1558                            root   11r      BLK              253,3 0x200000000      16296 /dev/vda3
elemental 1558                            root   12r      BLK              253,4 0x3c0000000      16297 /dev/vda4
elemental 1558                            root   13r      BLK              253,5 0x1bbd00000      16300 /dev/vda5
elemental 1558 1559 elemental             root    9r      BLK              253,1   0t1048576      16294 /dev/vda1
elemental 1558 1559 elemental             root   10r      BLK              253,2  0t67108864      16295 /dev/vda2
elemental 1558 1559 elemental             root   11r      BLK              253,3 0x200000000      16296 /dev/vda3
elemental 1558 1559 elemental             root   12r      BLK              253,4 0x3c0000000      16297 /dev/vda4
elemental 1558 1559 elemental             root   13r      BLK              253,5 0x1bbd00000      16300 /dev/vda5
elemental 1558 1560 elemental             root    9r      BLK              253,1   0t1048576      16294 /dev/vda1
elemental 1558 1560 elemental             root   10r      BLK              253,2  0t67108864      16295 /dev/vda2
elemental 1558 1560 elemental             root   11r      BLK              253,3 0x200000000      16296 /dev/vda3
elemental 1558 1560 elemental             root   12r      BLK              253,4 0x3c0000000      16297 /dev/vda4
elemental 1558 1560 elemental             root   13r      BLK              253,5 0x1bbd00000      16300 /dev/vda5
elemental 1558 1561 elemental             root    9r      BLK              253,1   0t1048576      16294 /dev/vda1
elemental 1558 1561 elemental             root   10r      BLK              253,2  0t67108864      16295 /dev/vda2
elemental 1558 1561 elemental             root   11r      BLK              253,3 0x200000000      16296 /dev/vda3
elemental 1558 1561 elemental             root   12r      BLK              253,4 0x3c0000000      16297 /dev/vda4
elemental 1558 1561 elemental             root   13r      BLK              253,5 0x1bbd00000      16300 /dev/vda5
elemental 1558 1562 elemental             root    9r      BLK              253,1   0t1048576      16294 /dev/vda1
elemental 1558 1562 elemental             root   10r      BLK              253,2  0t67108864      16295 /dev/vda2
elemental 1558 1562 elemental             root   11r      BLK              253,3 0x200000000      16296 /dev/vda3
elemental 1558 1562 elemental             root   12r      BLK              253,4 0x3c0000000      16297 /dev/vda4
elemental 1558 1562 elemental             root   13r      BLK              253,5 0x1bbd00000      16300 /dev/vda5
elemental 1558 1563 elemental             root    9r      BLK              253,1   0t1048576      16294 /dev/vda1
elemental 1558 1563 elemental             root   10r      BLK              253,2  0t67108864      16295 /dev/vda2
elemental 1558 1563 elemental             root   11r      BLK              253,3 0x200000000      16296 /dev/vda3
elemental 1558 1563 elemental             root   12r      BLK              253,4 0x3c0000000      16297 /dev/vda4
elemental 1558 1563 elemental             root   13r      BLK              253,5 0x1bbd00000      16300 /dev/vda5
elemental 1558 1564 elemental             root    9r      BLK              253,1   0t1048576      16294 /dev/vda1
elemental 1558 1564 elemental             root   10r      BLK              253,2  0t67108864      16295 /dev/vda2
elemental 1558 1564 elemental             root   11r      BLK              253,3 0x200000000      16296 /dev/vda3
elemental 1558 1564 elemental             root   12r      BLK              253,4 0x3c0000000      16297 /dev/vda4
elemental 1558 1564 elemental             root   13r      BLK              253,5 0x1bbd00000      16300 /dev/vda5

I added davidcassany/linuxkit#5 as an attempt to solve it, however I can't understand why all devices seams to be opened by multiple thread.... πŸ€” In any case the traces from below they disappear as soon as the cdrom datasource is removed, so we are certain it is the cdrom provider the responsible of opening all this devices.

davidcassany commented 2 years ago

Closing again since there is already a workaround in place in rancher/elemental#212 Note this is not solve, this is just a workaround. A proper work is likely to require a dive into linuxkit and go-diskfs libraries to properly release devices after probing them.