openshift / installer

Install an OpenShift 4.x cluster
https://try.openshift.com
Apache License 2.0
1.44k stars 1.39k forks source link

both bootstrap and master VM paused after creation #816

Closed gnufied closed 5 years ago

gnufied commented 5 years ago

Version

─▪ ./bin/openshift-install version ./bin/openshift-install v0.5.0-master-36-gb4f5ceb6bfde8d3dc0e29f708e0494488ea37ee0 Terraform v0.11.8

Your version of Terraform is out of date! The latest version is 0.11.10. You can update by downloading from www.terraform.io/downloads.html ┌─[hemant][openshift4-host][±][master ?:7 ✗][~/.../openshift/installer]

Platform (aws|libvirt|openstack):

libvirt (On Fedora29 machine with nested VM).

What happened?

I think I did everything as per instructions and it has worked for me in past, but as soon as the installer creates bootstrap and master VMs, they are paused.

─▪ virsh -c "${OPENSHIFT_INSTALL_LIBVIRT_URI}" list
 Id   Name              State   
--------------------------------
 3    test2-bootstrap   paused  
 4    test2-master-0    paused  

I thought, is it because of RAM or something but he VM was given 20GB of RAM. More information from journalctl of VM - https://gist.githubusercontent.com/gnufied/322c1254c9170144a1f993cdd8e6e8e2/raw/ec904e96689936aa6a67eaf94b45834ddd73aa9d/gistfile1.txt

gnufied commented 5 years ago

Could it be related to https://github.com/openshift/installer/issues/708 ? I did a netinstall of fedora29 server image and I noticed that dnsmasq isn't running. Manually trying to start dnsmasq conflicts on port 53.

ec 06 17:19:36 openshift4-host.lan systemd[1]: Started DNS caching server..
Dec 06 17:19:36 openshift4-host.lan dnsmasq[24226]: dnsmasq: failed to create listening socket for port 53: Address already in use
Dec 06 17:19:36 openshift4-host.lan dnsmasq[24226]: failed to create listening socket for port 53: Address already in use
Dec 06 17:19:36 openshift4-host.lan dnsmasq[24226]: FAILED to start up
Dec 06 17:19:36 openshift4-host.lan systemd[1]: dnsmasq.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Dec 06 17:19:36 openshift4-host.lan systemd[1]: dnsmasq.service: Failed with result 'exit-code'.
gnufied commented 5 years ago

So, it looks like because I configured dnsmasq for DNS in NetworkManager settings, dnsmasq is being started by NetworkManager rather than standalone systemd service dnsmasq.

[root@openshift4-host ~]# lsof -i :53
COMMAND   PID    USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
dnsmasq  1342 dnsmasq    5u  IPv4  30672      0t0  UDP openshift4-host.lan:domain 
dnsmasq  1342 dnsmasq    6u  IPv4  30673      0t0  TCP openshift4-host.lan:domain (LISTEN)
dnsmasq  2504 dnsmasq    4u  IPv4  38729      0t0  UDP localhost:domain 
dnsmasq  2504 dnsmasq    5u  IPv4  38730      0t0  TCP localhost:domain (LISTEN)
dnsmasq 19581 dnsmasq    5u  IPv4 121156      0t0  UDP openshift4-host.lan:domain 
dnsmasq 19581 dnsmasq    6u  IPv4 121157      0t0  TCP openshift4-host.lan:domain (LISTEN)
[root@openshift4-host ~]# ps aux|grep dnsmasq

So port 53 is indeed taken by dnsmasq even though dnsmasq systemd unit is not running. So this is something else.

gnufied commented 5 years ago

Hmm, booted a new VM based on full fedora29-server image and same problem. Immediately after starting the bootstrap and master VM, they get paused. Tried connecting to them via console:

─▪ virsh -c "${OPENSHIFT_INSTALL_LIBVIRT_URI}" console --domain test1-bootstrap
Connected to domain test1-bootstrap
Escape character is ^]
gnufied commented 5 years ago

Hmm, found the vm log finally in /var/log/libvirt/qemu/test1-bootstrap.log and got:

2018-12-06 23:52:05.167+0000: starting up libvirt version: 4.7.0, package: 1.fc29 (Fedora Project, 2018-09-04-10:29:06, ), qemu version: 3.0.0qemu-3.0.0-2.fc29, kernel: 4.18.16-300.fc29.x86_64, hostname: openshift-libvirt.lan
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=spice /usr/bin/qemu-kvm -name guest=test1-bootstrap,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-test1-bootstrap/master-key.aes -machine pc-i440fx-3.0,accel=kvm,usb=off,dump-guest-core=off -m 2048 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 5ba43ae2-c1ff-483e-bdbe-ea5b8c5a5bfb -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=26,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive file=/var/lib/libvirt/images/test1-bootstrap,format=qcow2,if=none,id=drive-virtio-disk0 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=33,id=hostnet0,vhost=on,vhostfd=35 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=3a:b9:7b:c3:28:a4,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charchannel0 -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 -spice port=5900,addr=127.0.0.1,disable-ticketing,seamless-migration=on -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -object rng-random,id=objrng0,filename=/dev/random -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x7 -fw_cfg name=opt/com.coreos/config,file=/var/lib/libvirt/images/test1-bootstrap.ign -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on
2018-12-06 23:52:05.167+0000: Domain id=1 is tainted: custom-argv
2018-12-06T23:52:05.225201Z qemu-system-x86_64: -chardev pty,id=charserial0: char device redirected to /dev/pts/3 (label charserial0)
2018-12-06T23:52:05.225464Z qemu-system-x86_64: -chardev pty,id=charchannel0: char device redirected to /dev/pts/4 (label charchannel0)
KVM: entry failed, hardware error 0x7
EAX=00000000 EBX=00000000 ECX=00000000 EDX=00000663
ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000
EIP=0000fff0 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 00000000 0000ffff 00009300
CS =f000 ffff0000 0000ffff 00009b00
SS =0000 00000000 0000ffff 00009300
DS =0000 00000000 0000ffff 00009300
FS =0000 00000000 0000ffff 00009300
GS =0000 00000000 0000ffff 00009300
LDT=0000 00000000 0000ffff 00008200
TR =0000 00000000 0000ffff 00008b00
GDT=     00000000 0000ffff
IDT=     00000000 0000ffff
CR0=60000010 CR2=00000000 CR3=00000000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000
Code=06 66 05 00 00 01 00 8e c1 26 66 a3 74 f7 66 5b 66 5e 66 c3 <ea> 5b e0 00 f0 30 36 2f 32 33 2f 39 39 00 fc 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
gnufied commented 5 years ago

So, this may be a combination - my attempt of using nested virtualization on fedora 29. But couple of weeks back this worked on same machine. But def. linux kernel has changed, so I am not sure what broke it now. :(

gnufied commented 5 years ago

I guess, this is not a bug in installer and hence I am going to close it but just for documentation purpose(in case somebody else runs in the same bug) - I have filed a bug against Fedora/Kernel - https://bugzilla.redhat.com/show_bug.cgi?id=1657296 .