openshift / installer

Install an OpenShift 4.x cluster
https://try.openshift.com
Apache License 2.0
1.42k stars 1.38k forks source link

Openshift 4.7.4 IPI failed to create cluster: failed to apply Terraform: failed to complete the change #4826

Closed carlos-farias closed 2 years ago

carlos-farias commented 3 years ago

Version

$ openshift-install version
./openshift-install 4.7.4
built from commit 7d4efe10b441e9cb3dda33f81c62fd0eaeb3d6e6
release image quay.io/openshift-release-dev/ocp-release@sha256:999a6a4bd731075e389ae601b373194c6cb2c7b4dadd1ad06ef607e86476b129

Platform:

vSpehere 7 ESXi7

IPI

What happened?

Error: =fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply Terraform: failed to complete the change"

time="2021-04-07T11:57:38-04:00" level=debug msg="vsphere_folder.folder[0]: Creating..." time="2021-04-07T11:57:39-04:00" level=debug msg="vsphere_folder.folder[0]: Creation complete after 0s [id=group-v34]" time="2021-04-07T11:57:39-04:00" level=debug msg="vsphereprivate_import_ova.import: Creating..." time="2021-04-07T11:57:39-04:00" level=error time="2021-04-07T11:57:39-04:00" level=error msg="Error: rpc error: code = Unavailable desc = transport is closing" time="2021-04-07T11:57:39-04:00" level=error time="2021-04-07T11:57:39-04:00" level=error time="2021-04-07T11:57:39-04:00" level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply Terraform: failed to complete the change"

See the troubleshooting documentation for ideas about what information to collect. For example, if the installer fails to create resources, attach the relevant portions of your .openshift_install.log.

What you expected to happen?

Create from IPI an Openshift Cluster

How to reproduce it (as minimally and precisely as possible)?

I followed all the instructions from https://docs.openshift.com/container-platform/4.7/installing/installing_vsphere/installing-vsphere-installer-provisioned.html

$ [root@Multiporpose ~]# ./openshift-install create cluster

Anything else we need to know?

Enter text here.

References

staebler commented 3 years ago

This is a problem contacting the ESXi host to import the OVA. See if https://github.com/openshift/installer/issues/4761helps you. The problem in that one ended up being proxy configuration. If you can run the installer with the environment variable TF_LOG=debug, that may provide some more information.

carlos-farias commented 3 years ago

I tryed running the installer with the environment variable TF_LOG=debug, but the error don't show more information about the issue.

INFO Obtaining RHCOS image file from 'https://releases-art-rhcos.svc.ci.openshift.org/art/storage/releases/rhcos-4.7/47.83.202102090044-0/x86_64/rhcos-47.83.202102090044-0-vmware.x86_64.ova?sha256=13d92692b8eed717ff8d0d113a24add339a65ef1f12eceeb99dabcd922cc86d1'
INFO The file was found in cache: /root/.cache/openshift-installer/image_cache/3b90b8f621548d33b166787e8d70207d. Reusing...
INFO Creating infrastructure resources...
ERROR
ERROR Error: rpc error: code = Unavailable desc = transport is closing
ERROR
ERROR
FATAL failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: failed to complete the change
[root@Multiporpose intento0]# echo $TF_LOG
debug
[root@Multiporpose intento0]#
carlos-farias commented 3 years ago

I tryed

[root@Multiporpose openshift]# openshift-install create cluster --log-level=debug

But I am getting the same information.

DEBUG Initializing modules...
DEBUG - bootstrap in ../../tmp/openshift-install-285837255/bootstrap
DEBUG - master in ../../tmp/openshift-install-285837255/master
DEBUG
DEBUG Initializing the backend...
DEBUG
DEBUG Initializing provider plugins...
DEBUG
DEBUG Terraform has been successfully initialized!
DEBUG
DEBUG You may now begin working with Terraform. Try running "terraform plan" to see
DEBUG any changes that are required for your infrastructure. All Terraform commands
DEBUG should now work.
DEBUG
DEBUG If you ever set or change modules or backend configuration for Terraform,
DEBUG rerun this command to reinitialize your working directory. If you forget, other
DEBUG commands will detect it and remind you to do so if necessary.
DEBUG data.vsphere_datacenter.datacenter: Refreshing state...
DEBUG data.vsphere_compute_cluster.cluster: Refreshing state...
DEBUG data.vsphere_datastore.datastore: Refreshing state...
DEBUG data.vsphere_network.network: Refreshing state...
DEBUG vsphere_tag_category.category: Creating...
DEBUG vsphere_tag_category.category: Creation complete after 0s [id=urn:vmomi:InventoryServiceCategory:80945e53-9df1-4382-8c72-2a6145b4cc97:GLOBAL]
DEBUG vsphere_tag.tag: Creating...
DEBUG vsphere_tag.tag: Creation complete after 0s [id=urn:vmomi:InventoryServiceTag:e030f3c8-ead7-4e21-9db5-0e12dc2fb065:GLOBAL]
DEBUG vsphere_folder.folder[0]: Creating...
DEBUG vsphere_folder.folder[0]: Creation complete after 0s [id=group-v36]
DEBUG vsphereprivate_import_ova.import: Creating...
ERROR
ERROR Error: rpc error: code = Unavailable desc = transport is closing
ERROR
ERROR
FATAL failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: failed to complete the change
carlos-farias commented 3 years ago

I was checking the https://github.com/openshift/installer/issues/4761 issua, but I don't use a proxy, all the machines in the network has direct access to internet.

staebler commented 3 years ago

Do you mind trying again with TF_LOG=trace?

carlos-farias commented 3 years ago

Matthew,

I tried but no new information. This version of installer don't ask for SSH_Keys, and documentation indicates that step is optional. Reading other posts with a similar message Error: rpc error: code = Unavailable desc = transport is closing could be a network error or a handshake failure. I will try the optional step but if the installaer don't ask for SSH_keys, how do I force use the keys?

Here is the message with TF_LOG=trace

INFO Obtaining RHCOS image file from 'https://releases-art-rhcos.svc.ci.openshift.org/art/storage/releases/rhcos-4.7/47.83.202102090044-0/x86_64/rhcos-47.83.202102090044-0-vmware.x86_64.ova?sha256=13d92692b8eed717ff8d0d113a24add339a65ef1f12eceeb99dabcd922cc86d1'
INFO The file was found in cache: /root/.cache/openshift-installer/image_cache/3b90b8f621548d33b166787e8d70207d. Reusing...
INFO Creating infrastructure resources...
ERROR
ERROR Error: rpc error: code = Unavailable desc = transport is closing
ERROR
ERROR
FATAL failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: failed to complete the change
[root@Multiporpose intento_0]# echo $TF_LOG
trace
[root@Multiporpose intento_0]#

Thanks for your kind support.

staebler commented 3 years ago

Are yo seeing any TRACE log messages in the .openshift_install.log file?

I will try the optional step but if the installaer don't ask for SSH_keys, how do I force use the keys?

SSH keys are not relevant here. Nothing is using ssh at this point.

carlos-farias commented 3 years ago

Hi,

The final lines of openshift-install.log

time="2021-04-07T21:49:25-04:00" level=debug msg="vsphere_tag_category.category: Creation complete after 1s [id=urn:vmomi:InventoryServiceCategory:bbd72f27-6291-445e-a5a0-37f01164a78e:GLOBAL]"
time="2021-04-07T21:49:25-04:00" level=debug msg="vsphere_tag.tag: Creating..."
time="2021-04-07T21:49:25-04:00" level=debug msg="vsphere_tag.tag: Creation complete after 0s [id=urn:vmomi:InventoryServiceTag:b0846a65-1727-494a-affc-fd23f77fea55:GLOBAL]"
time="2021-04-07T21:49:25-04:00" level=debug msg="vsphere_folder.folder[0]: Creating..."
time="2021-04-07T21:49:26-04:00" level=debug msg="vsphere_folder.folder[0]: Creation complete after 0s [id=group-v38]"
time="2021-04-07T21:49:26-04:00" level=debug msg="vsphereprivate_import_ova.import: Creating..."
time="2021-04-07T21:49:26-04:00" level=error
time="2021-04-07T21:49:26-04:00" level=error msg="Error: rpc error: code = Unavailable desc = transport is closing"
time="2021-04-07T21:49:26-04:00" level=error
time="2021-04-07T21:49:26-04:00" level=error
time="2021-04-07T21:49:26-04:00" level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply Terraform: failed to complete the change"

It seems there is no TRACE option for log. I tried with this line too

[root@Multiporpose intento_0]# openshift-install create cluster --log-level trace
[root@Multiporpose intento_0]# openshift-install
Creates OpenShift clusters

Usage:
  openshift-install [command]

Available Commands:
  completion  Outputs shell completions for the openshift-install command
  create      Create part of an OpenShift cluster
  destroy     Destroy part of an OpenShift cluster
  explain     List the fields for supported InstallConfig versions
  gather      Gather debugging data for a given installation failure
  graph       Outputs the internal dependency graph for installer
  help        Help about any command
  migrate     Do a migration
  version     Print version information
  wait-for    Wait for install-time events

Flags:
      --dir string         assets directory (default ".")
  -h, --help               help for openshift-install
      --log-level string   log level (e.g. "debug | info | warn | error") (default "info")

Use "openshift-install [command] --help" for more information about a command.
[root@Multiporpose intento_0]#
staebler commented 3 years ago

Hmm, the TF_LOG setting is not getting picked up. I would expect log entries similar to the following in your log file. Notice the [DEBUG] parts, which are coming from the terraform log.

time="2021-04-07T14:09:24Z" level=debug msg="2021-04-07T14:09:24.143Z [DEBUG] plugin.terraform-provider-vsphere: 2021/04/07 14:09:24 [DEBUG] Reading tags for object \"group-v248247\""
time="2021-04-07T14:09:24Z" level=debug msg="2021-04-07T14:09:24.194Z [DEBUG] plugin.terraform-provider-vsphere: 2021/04/07 14:09:24 [DEBUG] Tags for object \"group-v248247\": urn:vmomi:InventoryServiceTag:75841746-3120-479d-8d64-6f8f2247af50:GLOBAL"
time="2021-04-07T14:09:24Z" level=debug msg="vsphere_folder.folder[0]: Creation complete after 0s [id=group-v248247]"
time="2021-04-07T14:09:24Z" level=debug msg="vsphereprivate_import_ova.import: Creating..."
time="2021-04-07T14:09:24Z" level=debug msg="2021/04/07 14:09:24 [DEBUG] vsphereprivate_import_ova.import: applying the planned Create change"
time="2021-04-07T14:09:24Z" level=debug msg="2021-04-07T14:09:24.204Z [DEBUG] plugin.terraform-provider-vsphereprivate: 2021/04/07 14:09:24 [DEBUG] /tmp/.cache/openshift-installer/image_cache/3b90b8f621548d33b166787e8d70207d: Beginning import ova create"
carlos-farias commented 3 years ago

I can't get a trace yet, attached is the full log.

[root@Multiporpose intento_2]# echo $TF_LOG
trace
[root@Multiporpose intento_2]# openshift-install create cluster --log-level trace

openshift_install.log

carlos-farias commented 3 years ago

Hello, There another way to force a DEBUG or TRACE? or some manual way to execute the terraform to check where is the issue?

staebler commented 3 years ago

Hello, There another way to force a DEBUG or TRACE? or some manual way to execute the terraform to check where is the issue?

I have not encountered the TF_LOG environment variable not working, so I don't have any immediate advice on that. I have it on my TODO list to look into why the env var may not be working.

carlos-farias commented 3 years ago

Hi,

We found the issue with our inatallation using IPI.

Since IPI did not work we start an UPI installation using the [OCP4 Helper node] (https://github.com/RedHatOfficial/ocp4-helpernode) and then [ocp4-vsphere-upi-automation] (https://github.com/RedHatOfficial/ocp4-vsphere-upi-automation#run-installation-playbook).

We had errors too and we realize the network name was changed after the installation on vSpehere and the sdk returned the old network name, but when it try to deploy the netowork the name was not found. The python scripts shown the error, so we change the network name in the config files.

To solve the issue with the IPI we created an install-config.yaml and change manually the vSphere network name.

The bootstrap and master0-2 were deployed and started, bootstrap show a login session but the master nodes seems failed to get ignition files. So it is a new issue to solve.

image

I have another question, IPI don't let to use the same IP for the load balancer for API and Ingress why?

Regards

staebler commented 3 years ago

I have another question, IPI don't let to use the same IP for the load balancer for API and Ingress why?

The backends of the API and the Ingress are different, so they need to be different load balancers and as such different IPs.

duritong commented 3 years ago

Once https://github.com/openshift/installer/pull/4906 is getting merged, the installer will check on the validity of the network before starting terraform. That should at least address the initial issue.

openshift-bot commented 3 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 3 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot commented 2 years ago

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci[bot] commented 2 years ago

@openshift-bot: Closing this issue.

In response to [this](https://github.com/openshift/installer/issues/4826#issuecomment-933849019): >Rotten issues close after 30d of inactivity. > >Reopen the issue by commenting `/reopen`. >Mark the issue as fresh by commenting `/remove-lifecycle rotten`. >Exclude this issue from closing again by commenting `/lifecycle frozen`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.