Closed jhutar closed 5 years ago
Logs with internal hostname replaced by "HOSTNAME" string. Let me know in next ~3 days if you would like to see the actual system.
dev-scripts-make.log 01_install_requirements-2019-04-03-094035.log 02_configure_host-2019-04-03-094759.log 03_ocp_repo_sync-2019-04-03-094937.log 04_setup_ironic-2019-04-03-095417.log 04_setup_ironic-2019-04-03-095418.log 05_build_ocp_installer-2019-04-03-101309.log 06_create_cluster-2019-04-03-101407.log
Can you confirm if you had https://github.com/openshift-metalkube/dev-scripts/pull/254 applied please?
We were definitely seeing OOMKilled prior to that, but we probably need to look at the AWS flavors if the increase from #254 wasn't enough.
Also note you can re-test this with more memory by modifying tripleo-quickstart-config/metalkube-nodes.yml
- it's a tough compromize because some folks want to do minimal testing on memory contrained hosts, and some want to do more realistic tests on a box with plenty of spare resources.
$ git log | grep d16dcdeca841b9ce291156c67a1ba47a9b2b8c98
Now while make
is running...
@hardys Problem I had was that even the host itself swapped something which means that on a host with 46GB RAM I can not increase VM's memory.
Ack thanks, looking at the openshift/installer code it looks like the default instance type for AWS is m4/m5 xlarge, which means 16GB and 4vCPU
Regarding the swapping, can you perhaps check if there are any other workloads consuming resources on the host? I've been testing on a 32G host and it's not swapping, my assumption was that KSM was doing a good job of sharing duplicate pages between the VMs
That was a clean up2date CentOS7 installation - nothing else was running on the host. Only thing I have created was:
oc --config ocp/auth/kubeconfig new-app https://github.com/OpenShiftDemos/os-sample-python.git
I'll take a look at the closer memory consumption once make
finishes.
OK, make
is still running this loop:
level=debug msg="Still waiting for the Kubernetes API: Get https://api.ostest.test.metalkube.org:6443/version?timeout=32s: dial tcp 192.168.111.5:6443: connect: connection refused"
level=debug msg="Still waiting for the Kubernetes API: Get https://api.ostest.test.metalkube.org:6443/version?timeout=32s: dial tcp 192.168.111.5:6443: connect: connection refused"
level=debug msg="Still waiting for the Kubernetes API: Get https://api.ostest.test.metalkube.org:6443/version?timeout=32s: dial tcp 192.168.111.5:6443: connect: connection refused"
this is an issue that is now solved. i suggest you retry from scratch and feel free to increment memory and virtual cpus for the master nodes so that all pods can run
After new installation I have:
[kni@hp-dl360gen8-01 dev-scripts]$ oc --config ocp/auth/kubeconfig get pods --all-namespaces | wc -l
3447
[kni@hp-dl360gen8-01 dev-scripts]$ oc --config ocp/auth/kubeconfig get pods --all-namespaces | grep -v Running | wc -l
3339
I assume it is not healthy state? How to debug that please?
There's not enough information here to propose any fix, so I'm going to close this. If you're still having issues please either raise another issue with more details, or jump onto slack and we can talk about the steps to debug, thanks!
After installation that passed, there is too many pods it too many different states like (I particularly dislike "Terminating" and "OOMKilled" :)).