openshift-metal3 / dev-scripts

Scripts to automate development/test setup for openshift integration with https://github.com/metal3-io/
Apache License 2.0
94 stars 186 forks source link

ironic containers running under the metal3-baremetal-operator pod are crash-looping #703

Closed mcornea closed 5 years ago

mcornea commented 5 years ago

Describe the bug ironic containers running under the metal3-baremetal-operator pod are crash-looping. After running make:

[cloud-user@rhhi-node-worker-0 ~]$ oc --config dev-scripts/ocp/auth/kubeconfig -n openshift-machine-api status
In project openshift-machine-api on server https://api.rhhi-virt-cluster.qe.lab.redhat.com:6443

svc/cluster-autoscaler-operator - 172.30.113.127 ports 443->8443, 8080->metrics
  deployment/cluster-autoscaler-operator deploys registry.svc.ci.openshift.org/ocp/4.2-2019-07-22-025130@sha256:b95392771f7b3e00d7c9560469a312c4933d097e6b3fe320e7961d688885d6ca
    deployment #1 running for 24 minutes - 1 pod

svc/machine-api-operator - 172.30.178.199:8080 -> metrics
  deployment/machine-api-operator deploys registry.svc.ci.openshift.org/ocp/4.2-2019-07-22-025130@sha256:f2b371781c1320e161a6bc15b9fbbc9f24f3a94c34a45809c2766966f4cc74f0
    deployment #1 running for 40 minutes - 1 pod (warning: 10 restarts)

deployment/machine-api-controllers deploys registry.svc.ci.openshift.org/ocp/4.2-2019-07-22-025130@sha256:0283c4a29d70e44848ecdddffe7b366c2dac25e6a3a1e57719b1c7c5f5ec8021,registry.svc.ci.openshift.org/ocp/4.2-2019-07-22-025130@sha256:0283c4a29d70e44848ecdddffe7b366c2dac25e6a3a1e57719b1c7c5f5ec8021,registry.svc.ci.openshift.org/ocp/4.2-2019-07-22-025130@sha256:f2b371781c1320e161a6bc15b9fbbc9f24f3a94c34a45809c2766966f4cc74f0
  deployment #1 running for 39 minutes - 1 pod

deployment/metal3-baremetal-operator deploys quay.io/metal3-io/baremetal-operator:master,quay.io/metal3-io/ironic:master,quay.io/metal3-io/ironic:master,quay.io/metal3-io/ironic:master,quay.io/metal3-io/ironic:master,quay.io/metal3-io/ironic:master,quay.io/metal3-io/ironic-inspector:master,quay.io/metal3-io/static-ip-manager:latest
  deployment #1 running for 15 minutes - 0/1 pods (warning: 5 restarts)

Errors:
  * container "ironic-api" in pod/metal3-baremetal-operator-74fdb86688-qw6rg is crash-looping
  * container "ironic-conductor" in pod/metal3-baremetal-operator-74fdb86688-qw6rg is crash-looping
  * container "ironic-dnsmasq" in pod/metal3-baremetal-operator-74fdb86688-qw6rg is crash-looping
  * container "ironic-httpd" in pod/metal3-baremetal-operator-74fdb86688-qw6rg is crash-looping

4 errors, 1 warning, 4 infos identified, use 'oc status --suggest' to see details.
[cloud-user@rhhi-node-worker-0 ~]$ 
[cloud-user@rhhi-node-worker-0 ~]$ 
[cloud-user@rhhi-node-worker-0 ~]$ 
[cloud-user@rhhi-node-worker-0 ~]$ 
[cloud-user@rhhi-node-worker-0 ~]$ oc --config dev-scripts/ocp/auth/kubeconfig -n openshift-machine-api get pods
NAME                                          READY   STATUS             RESTARTS   AGE
cluster-autoscaler-operator-cf969ffc4-btz5b   1/1     Running            0          25m
machine-api-controllers-85fc67ff6c-hjg79      3/3     Running            0          40m
machine-api-operator-577f585945-vrg6m         1/1     Running            10         41m
metal3-baremetal-operator-74fdb86688-qw6rg    4/8     CrashLoopBackOff   24         16m

To Reproduce Deploy a 3 nodes cluster with following config:

[cloud-user@rhhi-node-worker-0 ~]$ cat dev-scripts/config_cloud-user.sh 
#!/bin/bash

# Get a valid pull secret (json string) from
# You can get this secret from https://cloud.openshift.com/clusters/install#pull-secret
set +x
NODES_PLATFORM=baremetal
INT_IF=eth1
PRO_IF=eth0
CLUSTER_PRO_IF=ens3
EXT_IF=
ROOT_DISK=/dev/sda
NODES_FILE=/home/cloud-user/instackenv.json
MANAGE_BR_BRIDGE=n
NUM_WORKERS=0
CLUSTER_NAME=rhhi-virt-cluster
BASE_DOMAIN=qe.lab.redhat.com

Expected/observed behavior Deployment completes successfully but the ironic containers running under the baremetal-operator pod are crash-looping

Additional context

[cloud-user@rhhi-node-worker-0 ~]$ oc --config dev-scripts/ocp/auth/kubeconfig -n openshift-machine-api logs metal3-baremetal-operator-74fdb86688-qw6rg -c ironic-api
iptables v1.4.21: can't initialize iptables table `filter': Table does not exist (do you need to insmod?)
Perhaps iptables or your kernel needs to be upgraded.
[cloud-user@rhhi-node-worker-0 ~]$ oc --config dev-scripts/ocp/auth/kubeconfig -n openshift-machine-api logs metal3-baremetal-operator-74fdb86688-qw6rg -c ironic-conductor
iptables v1.4.21: can't initialize iptables table `filter': Table does not exist (do you need to insmod?)
Perhaps iptables or your kernel needs to be upgraded.
[cloud-user@rhhi-node-worker-0 ~]$ oc --config dev-scripts/ocp/auth/kubeconfig -n openshift-machine-api logs metal3-baremetal-operator-74fdb86688-qw6rg -c ironic-dnsmasq
iptables v1.4.21: can't initialize iptables table `filter': Table does not exist (do you need to insmod?)
Perhaps iptables or your kernel needs to be upgraded.
[cloud-user@rhhi-node-worker-0 ~]$ oc --config dev-scripts/ocp/auth/kubeconfig -n openshift-machine-api logs metal3-baremetal-operator-74fdb86688-qw6rg -c ironic-httpd
iptables v1.4.21: can't initialize iptables table `filter': Table does not exist (do you need to insmod?)
Perhaps iptables or your kernel needs to be upgraded.
mcornea commented 5 years ago

I managed to get the ironic containers running on the master node after:

avc denials:

[root@rhhi-node-master-0 core]# modprobe ip_tables
[root@rhhi-node-master-0 core]# lsmod | grep iptable
[root@rhhi-node-master-0 core]# setenforce 0
[root@rhhi-node-master-0 core]# lsmod | grep iptable
iptable_filter         16384  1
ip_tables              28672  1 iptable_filter

[root@rhhi-node-master-0 core]# dmesg | grep denied
[ 3505.474392] audit: type=1400 audit(1564448204.432:5): avc:  denied  { module_request } for  pid=98308 comm="iptables" kmod="iptable_filter" scontext=system_u:system_r:container_t:s0:c105,c202 tcontext=system_u:system_r:kernel_t:s0 tclass=system permissive=0
[ 3505.481231] audit: type=1400 audit(1564448204.433:6): avc:  denied  { module_request } for  pid=98308 comm="iptables" kmod="iptable_filter" scontext=system_u:system_r:container_t:s0:c105,c202 tcontext=system_u:system_r:kernel_t:s0 tclass=system permissive=0
[ 3505.486755] audit: type=1400 audit(1564448204.435:7): avc:  denied  { module_request } for  pid=98309 comm="iptables" kmod="iptable_filter" scontext=system_u:system_r:container_t:s0:c105,c202 tcontext=system_u:system_r:kernel_t:s0 tclass=system permissive=0
[ 3505.491788] audit: type=1400 audit(1564448204.435:8): avc:  denied  { module_request } for  pid=98309 comm="iptables" kmod="iptable_filter" scontext=system_u:system_r:container_t:s0:c105,c202 tcontext=system_u:system_r:kernel_t:s0 tclass=system permissive=0
[ 3587.168118] audit: type=1400 audit(1564448286.124:9): avc:  denied  { module_request } for  pid=104037 comm="iptables" kmod="iptable_filter" scontext=system_u:system_r:container_t:s0:c105,c202 tcontext=system_u:system_r:kernel_t:s0 tclass=system permissive=0
[ 3587.176576] audit: type=1400 audit(1564448286.124:10): avc:  denied  { module_request } for  pid=104037 comm="iptables" kmod="iptable_filter" scontext=system_u:system_r:container_t:s0:c105,c202 tcontext=system_u:system_r:kernel_t:s0 tclass=system permissive=0
[ 3587.184230] audit: type=1400 audit(1564448286.127:11): avc:  denied  { module_request } for  pid=104038 comm="iptables" kmod="iptable_filter" scontext=system_u:system_r:container_t:s0:c105,c202 tcontext=system_u:system_r:kernel_t:s0 tclass=system permissive=0
[ 3587.191919] audit: type=1400 audit(1564448286.127:12): avc:  denied  { module_request } for  pid=104038 comm="iptables" kmod="iptable_filter" scontext=system_u:system_r:container_t:s0:c105,c202 tcontext=system_u:system_r:kernel_t:s0 tclass=system permissive=0
[ 3589.014374] audit: type=1400 audit(1564448287.971:13): avc:  denied  { module_request } for  pid=104166 comm="iptables" kmod="iptable_filter" scontext=system_u:system_r:container_t:s0:c105,c202 tcontext=system_u:system_r:kernel_t:s0 tclass=system permissive=0
[ 3589.024676] audit: type=1400 audit(1564448287.973:14): avc:  denied  { module_request } for  pid=104166 comm="iptables" kmod="iptable_filter" scontext=system_u:system_r:container_t:s0:c105,c202 tcontext=system_u:system_r:kernel_t:s0 tclass=system permissive=0
[ 3589.033340] audit: type=1400 audit(1564448287.979:15): avc:  denied  { module_request } for  pid=104167 comm="iptables" kmod="iptable_filter" scontext=system_u:system_r:container_t:s0:c105,c202 tcontext=system_u:system_r:kernel_t:s0 tclass=system permissive=0
[ 3589.042370] audit: type=1400 audit(1564448287.979:16): avc:  denied  { module_request } for  pid=104167 comm="iptables" kmod="iptable_filter" scontext=system_u:system_r:container_t:s0:c105,c202 tcontext=system_u:system_r:kernel_t:s0 tclass=system permissive=0
[ 3600.827937] audit: type=1400 audit(1564448299.783:20): avc:  denied  { module_request } for  pid=105038 comm="iptables" kmod="iptable_filter" scontext=system_u:system_r:container_t:s0:c105,c202 tcontext=system_u:system_r:kernel_t:s0 tclass=system permissive=1
russellb commented 5 years ago

I would prefer that we remove all iptables calls from all of the containers. I don't think they are actually necessary for Ironic running within the cluster, but we need to verify that.

e-minguez commented 5 years ago

I'm having the same issue in a baremetal deployment (and fixed with the modprobe and disable selinux thing). Dan Walsh disapproves this workaround :dagger:

dhellmann commented 5 years ago

I would prefer that we remove all iptables calls from all of the containers. I don't think they are actually necessary for Ironic running within the cluster, but we need to verify that.

Do we need those for the podman-run containers on the provisioning host? Should we add a switch to enable/disable them instead of just removing them?

russellb commented 5 years ago

soon none of them will be running on the provisioning host once ironic is moved into the bootstrap VM.

yprokule commented 5 years ago

Same as https://github.com/metal3-io/ironic-image/issues/82 ?

dhellmann commented 5 years ago

Is this still an issue?

mcornea commented 5 years ago

Is this still an issue?

Nope, closed.