ocp-power-automation / ocp4-upi-powervm

OpenShift on IBM PowerVM servers managed using PowerVC
Apache License 2.0
27 stars 52 forks source link

BUG: master and worker nodes fail to run Ignition after installation of CoreOS #216

Closed TGerisch closed 3 years ago

TGerisch commented 3 years ago

When running ansible (ansible-playbook -e @vars-powervm.yaml playbooks/main.yaml), the playbook stops after appr. one hour saying it couldn't connect to master and worker nodes. I ran ansible again with -vvvv to get verbose output, which will be attached here. I started a HVC console on one of the missing master LPARs. It shows that CoreOS itself has been installed, but now it fails to fetch the ignition file(s) from the helper: [ 1784.746250] ignition[1248]: GET error: Get "https://api-int.bignumbers-01.saphana.example.com:22623/config/master": EOF [ **] A start job is running for Ignition (fetch) (29min 46s / no limit)[ 1789.746774] ignition[1248]: GET https://api-int.bignumbers-01.saphana.example.com:22623/config/master: attempt #361 [ 1789.748464] ignition[1248]: GET error: Get "https://api-int.bignumbers-01.saphana.example.com:22623/config/master": EOF [ *** ] A start job is running for Ignition (fetch) (29min 51s / no limit)[ 1794.749031] ignition[1248]: GET https://api-int.bignumbers-01.saphana.example.com:22623/config/master: attempt #362 ansible_vars-powervm.log

We tried to get further information about that by fetching the resource directly from within the helper node: `[root@bignumbers-01 ocp4-upi-powervm-hmc]# curl -kv https://api-int.bignumbers-01.saphana.example.com:22623/config/master

ansible_vars-powervm.log

yussufsh commented 3 years ago

Does rhcos nodes get the network address from the DHCP server? Could you attach full boot log?

yussufsh commented 3 years ago

I see you are using ocp4-upi-powervm-hmc repo? That one is internal IBM project. Please connect with @cs-zhang

TGerisch commented 3 years ago

I found the root cause of this issue - it's because we're using an unusal network configuration. We have configured to ibmveth VLANs - one is attached to a a SEA and extends the lab network, the other one (configured with a private IP range) is isolated. We created a LPAR which has two ibmveth adapters for each VLAN configured and we use this machine as a router. Unfortunately, if TCP packages are routed through ibmveth adapters, the RX and TX checksums are broken. So TCP connections cannot be established between the two networks. This can be mitigated by running the "ethtool -K rx off tx off" command - it switches off checksum creation and TCP works fine. But we did not find a way to automate this when we install CoreOS - so we're able to set up the bootstrap machine, but than bootstrap fails to fetch all arbitrary data - like docker files etc. The other nodes (master and workers) will receive empty data instead from the bootstrap node: 1784.746250] ignition[1248]: GET error: Get "https://api-int.bignumbers-01.saphana.example.com:22623/config/master": EOF [ **] A start job is running for Ignition (fetch) (29min 46s / no limit)[ 1789.746774] ignition[1248]: GET https://api-int.bignumbers-01.saphana.example.com:22623/config/master: attempt #361

Closing this as a bug - maybe there is a chance to set up CoreOS with such special device settings, but i didn't found any description how to do this.