openshift-metal3 / dev-scripts

Scripts to automate development/test setup for openshift integration with https://github.com/metal3-io/
Apache License 2.0
93 stars 185 forks source link

ipa-downloader container fails on bootstrap VM (trying to reach the external network default router on port 80) and deployment fails with: Error: could not contact API: timeout reached #741

Closed mcornea closed 5 years ago

mcornea commented 5 years ago

Describe the bug

[root@localhost core]# podman logs -f ipa-downloader
+ SNAP=current-tripleo-rdo
+ FILENAME=ironic-python-agent
+ FILENAME_EXT=.tar
+ FFILENAME=ironic-python-agent.tar
++ mktemp -d
+ TMPDIR=/tmp/tmp.MRNyx3ap0U
+ mkdir -p /shared/html/images
+ cd /shared/html/images
+ ls -l
total 0
+ '[' -n http://192.168.123.1/images -a '!' -e ironic-python-agent.tar.headers ']'
+ curl --fail -O http://192.168.123.1/images/ironic-python-agent.tar.headers
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed connect to 192.168.123.1:80; Connection r

192.168.123.1 is the default router for the external network

To Reproduce I believe this should be reproducible on any baremetal environment.

Expected/observed behavior I expect the ipa-downloader exits successfully

Additional context Add any other context about the problem here, links to related issues etc.

[cloud-user@rhhi-node-worker-0 dev-scripts]$ cat config_cloud-user.sh 
#!/bin/bash

# Get a valid pull secret (json string) from
# You can get this secret from https://cloud.openshift.com/clusters/install#pull-secret
set +x
set -x

# Uncomment to build a copy of ironic or inspector locally
#export IRONIC_INSPECTOR_IMAGE=https://github.com/metal3-io/ironic-inspector
#export IRONIC_IMAGE=https://github.com/metal3-io/ironic

# SSH key used to ssh into deployed hosts.  This must be the contents of the
# variable, not the filename. The contents of ~/.ssh/id_rsa.pub are used by
# default.
#export SSH_PUB_KEY=$(cat ~/.ssh/id_rsa.pub)

# Configure custom ntp servers if needed
#export NTP_SERVERS="00.my.internal.ntp.server.com;01.other.ntp.server.com"
BOOTSTRAP_SSH_READY=2500
NODES_PLATFORM=baremetal
INT_IF=eth1
PRO_IF=eth0
CLUSTER_PRO_IF=ens3
ROOT_DISK=/dev/sda
NODES_FILE=/home/cloud-user/instackenv.json
MANAGE_BR_BRIDGE=n
NUM_WORKERS=0
CLUSTER_NAME=rhhi-virt-cluster
BASE_DOMAIN=qe.lab.redhat.com
DNS_VIP=192.168.123.6
EXTERNAL_SUBNET=192.168.123.0/24
mcornea commented 5 years ago

It looks like this is caused by the CACHEURL env var which gets set incorrectly:

[root@localhost core]# podman inspect ipa-downloader | grep CACHEURL
                "CACHEURL=http://192.168.123.1/images"

## 192.168.123.1 is the router address so curl fails
[root@localhost core]# curl http://192.168.123.1/images/ironic-python-agent.tar.headers
curl: (7) Failed to connect to 192.168.123.1 port 80: Connection refused

##  if we use the address of the provision host instead then curl succeeds
[root@localhost core]# curl http://192.168.123.136/images/ironic-python-agent.tar.headers -I
HTTP/1.1 200 OK
Date: Fri, 16 Aug 2019 21:00:39 GMT
Server: Apache/2.4.6 (CentOS)
Last-Modified: Fri, 16 Aug 2019 19:55:32 GMT
ETag: "10b-590415f7cadc7"
Accept-Ranges: bytes
Content-Length: 267
Content-Type: application/x-tar
mcornea commented 5 years ago

CACHEURL comes from https://github.com/openshift/installer/blob/master/data/data/bootstrap/baremetal/files/usr/local/bin/startironic.sh.template#L73-L76

and the if conditional where curl is failing from https://github.com/metal3-io/ironic-ipa-downloader/blob/master/get-resource.sh#L19-L21

stbenjam commented 5 years ago

https://github.com/metal3-io/ironic-ipa-downloader/pull/4 should fix the problem of the cache miss being fatal.

Even on baremetal, the provisioning host might have a cache, I wonder if we can't have better logic for it's location.

russellb commented 5 years ago

OpenShift port: https://github.com/openshift/ironic-ipa-downloader/pull/10

russellb commented 5 years ago

Please reopen if you still see a problem once you test with a version that includes this commit

hardys commented 5 years ago

Even on baremetal, the provisioning host might have a cache, I wonder if we can't have better logic for it's location.

My intention with using the default route was that this should work both in the dev-scripts case, and in the baremetal case where there's a cache on the provisioning host, e.g in both cases it's probable that the default route is pointing to a bridge on the provisioning host where the bootstrap VM is running?

Seems that doesn't work in this case, probably because of MANAGE_BR_BRIDGE=n I guess?

Open to suggestions on how we can improve the location-guessing in startironic.sh, I was trying to avoid yet-another install-config option, but if necessary we can add one.