Closed brianotte closed 3 years ago
NOTE: The images download between 3.9 seconds and 19 seconds when using curl and wget from the host that runs openshift-install.
Additional NOTE: Can install Openshift RedHat 4.6.1 ( Red Hat openshift-install version) using on same host that is running the OKD openshift-install version -- with the same install-config.yaml
Only the OKD openshift-install receives error and fails.
Attempted again with openshift-install version 4.6.0-0.okd-2020-11-05-091140
Here is the related error:
time="2020-11-05T08:56:23-05:00" level=error
time="2020-11-05T08:56:23-05:00" level=error msg="Error: Error creating Blob \"rhcossywr4.vhd\" (Container \"vhd\" / Account \"clustersywr4\"): Error opening source file for upload \"https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/32.20201004.3.0/x86_64/fedora-coreos-32.20201004.3.0-azure.x86_64.vhd.xz\": open https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/32.20201004.3.0/x86_64/fedora-coreos-32.20201004.3.0-azure.x86_64.vhd.xz: no such file or directory"
time="2020-11-05T08:56:23-05:00" level=error
time="2020-11-05T08:56:23-05:00" level=error msg=" on ../../tmp/openshift-install-367165449/main.tf line 181, in resource \"azurerm_storage_blob\" \"rhcos_image\":"
time="2020-11-05T08:56:23-05:00" level=error msg=" 181: resource \"azurerm_storage_blob\" \"rhcos_image\" {"
time="2020-11-05T08:56:23-05:00" level=error
time="2020-11-05T08:56:23-05:00" level=error
time="2020-11-05T08:56:23-05:00" level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply Terraform: failed to complete the change"
time="2020-11-05T08:57:52-05:00" level=debug msg="OpenShift Installer 4.6.0-0.okd-2020-11-05-091140"
time="2020-11-05T08:57:52-05:00" level=debug msg="Built from commit c3816ca357a337d3396345d8c69b19fbde219884"
immediately tried wget and downloaded file in 13 seconds immediately tried curl and downloaded file in 4 seconds
I have copied the main.tf as is referenced in the error above. The stanza as produced with 4.5.15 install is identical with stanza as created with 4.6.0-0.okd-2020-11-05-091140
Here is that stanza:
resource "azurerm_storage_blob" "rhcos_image" {
name = "rhcos${random_string.storage_suffix.result}.vhd"
storage_account_name = azurerm_storage_account.cluster.name
storage_container_name = azurerm_storage_container.vhd.name
type = "Page"
source = var.azure_image_url
metadata = map("source", var.azure_image_url)
}
Within the main.tf file -- the azurem_image stanza is different between 4.5.1 and 4.6.0-0.okd-2020-11-05-091140 (and I believe all OKD 4.6.1 versions):
OKD 4.5.15 azurem_image stanza -- NOTE: resource_group_name:
resource "azurerm_image" "cluster" {
name = var.cluster_id
resource_group_name = azurerm_resource_group.main.name
location = var.azure_region
os_disk {
os_type = "Linux"
os_state = "Generalized"
blob_uri = azurerm_storage_blob.rhcos_image.url
}
}
OKD 4.6.0-0.okd-2020-11-05-091140 azurem_image stanza -- NOTE: resource_group_name:
resource "azurerm_image" "cluster" {
name = var.cluster_id
resource_group_name = data.azurerm_resource_group.main.name
location = var.azure_region
os_disk {
os_type = "Linux"
os_state = "Generalized"
blob_uri = azurerm_storage_blob.rhcos_image.url
}
It only seems to fail creating the azurerm_storage_blob.rhcos_image with 4.6.x+
When creating OKD 4.5.15 cluster, the .openshift-install.log contains the following entries related to azurerm_storage_blob.rhcos_image:
[zaphod@beeblebrox bin]# ./oc version
Client Version: 4.5.0-0.okd-2020-10-15-235428
[zaphod@beeblebrox okd]# grep azurerm_storage_blob.rhcos_image .openshift_install.log
time="2020-11-05T08:25:16-05:00" level=debug msg="azurerm_storage_blob.rhcos_image: Creating..."
time="2020-11-05T08:25:26-05:00" level=debug msg="azurerm_storage_blob.rhcos_image: Still creating... [10s elapsed]"
time="2020-11-05T08:25:36-05:00" level=debug msg="azurerm_storage_blob.rhcos_image: Still creating... [20s elapsed]"
time="2020-11-05T08:25:46-05:00" level=debug msg="azurerm_storage_blob.rhcos_image: Still creating... [30s elapsed]"
time="2020-11-05T08:25:56-05:00" level=debug msg="azurerm_storage_blob.rhcos_image: Still creating... [40s elapsed]"
time="2020-11-05T08:26:06-05:00" level=debug msg="azurerm_storage_blob.rhcos_image: Still creating... [50s elapsed]"
time="2020-11-05T08:26:16-05:00" level=debug msg="azurerm_storage_blob.rhcos_image: Still creating... [1m0s elapsed]"
time="2020-11-05T08:26:22-05:00" level=debug msg="azurerm_storage_blob.rhcos_image: Creation complete after 1m6s [id=https://cluster21wgz.blob.core.windows.net/vhd/rhcos21wgz.vhd]"
DEBUG module.dns.azureprivatedns_zone_virtual_network_link.network: Creation complete after 32s [id=/subscriptions/{REDACTED}/resourceGroups/{REDACTED}-mcdb7-rg/providers/Microsoft.Network/privateDnsZones/{REDACTED}/virtualNetworkLinks/{REDACTED}-network-link] ERROR ERROR Error: Error creating Blob "rhcosw77f0.vhd" (Container "vhd" / Account "clusterw77f0"): Error opening source file for upload "https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/32.20201004.3.0/x86_64/fedora-coreos-32.20201004.3.0-azure.x86_64.vhd.xz": open https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/32.20201004.3.0/x86_64/fedora-coreos-32.20201004.3.0-azure.x86_64.vhd.xz: no such file or directory ERROR ERROR on ../../tmp/openshift-install-739975449/main.tf line 181, in resource "azurerm_storage_blob" "rhcos_image": ERROR 181: resource "azurerm_storage_blob" "rhcos_image" { ERROR ERROR FATAL failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: failed to complete the change RETURN_CODE is: 1
Install attempt dated 2020/11/10 -- oc version Client Version: 4.6.0-0.okd-2020-11-10-041548
Is there any diagnostic I can try or additional information that is needed?
This issue's errors seems nearly identical with the following issues -- related to version 4.5:
However, I tried again with latest 4.6 openshift-install binary and still receive error.
Was the 4.5 fix also included in the 4.6 openshift-install?
Install attempt 2020/11/16: ./openshift-install version ./openshift-install 4.6.0-0.okd-2020-11-15-130950 built from commit c3816ca357a337d3396345d8c69b19fbde219884 release image registry.svc.ci.openshift.org/origin/release@sha256:0d543ce69abca236c5929fef3d081ddf9926d2e98c7b4029dd04d511183c1e11
Error received:
DEBUG module.dns.azureprivatedns_zone_virtual_network_link.network: Creation complete after 31s [id=/subscriptions/${REDACTED}/resourceGroups/MY_CLUSTER-pjwb2-rg/providers/Microsoft.Network/privateDnsZones/MY_CLUSTER.HAPPY_DOMAIN/virtualNetworkLinks/MY_CLUSTER-pjwb2-network-link] ERROR ERROR Error: Error creating Blob "rhcosrarda.vhd" (Container "vhd" / Account "clusterrarda"): Error opening source file for upload "https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/32.20201004.3.0/x86_64/fedora-coreos-32.20201004.3.0-azure.x86_64.vhd.xz": open https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/32.20201004.3.0/x86_64/fedora-coreos-32.20201004.3.0-azure.x86_64.vhd.xz: no such file or directory ERROR ERROR on ../../tmp/openshift-install-279147249/main.tf line 181, in resource "azurerm_storage_blob" "rhcos_image": ERROR 181: resource "azurerm_storage_blob" "rhcos_image" { ERROR ERROR FATAL failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: failed to complete the change RETURN_CODE is: 1
However, the download is accessible via wget: [zaphod@beeblebrox TEMP]# wget https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/32.20201004.3.0/x86_64/fedora-coreos-32.20201004.3.0-azure.x86_64.vhd.xz --2020-11-16 08:23:23-- https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/32.20201004.3.0/x86_64/fedora-coreos-32.20201004.3.0-azure.x86_64.vhd.xz Resolving ${REDACTED}(${REDACTED})... ${REDACTED} Connecting to ${REDACTED} (${REDACTED})|${REDACTED}|:8080... connected. Proxy request sent, awaiting response... 200 OK Length: 552118248 (527M) [application/x-xz] Saving to: ‘fedora-coreos-32.20201004.3.0-azure.x86_64.vhd.xz’
100%[================================================================================================================================================>] 552,118,248 35.0MB/s in 15s
2020-11-16 08:23:38 (35.7 MB/s) - ‘fedora-coreos-32.20201004.3.0-azure.x86_64.vhd.xz’ saved [552118248/552118248]
[zaphod@beeblebrox TEMP]# ls -ltr total 539720 -rw-r--r--. 1 zaphod zaphod 552118248 Oct 19 14:56 fedora-coreos-32.20201004.3.0-azure.x86_64.vhd.xz
@jomeier might be able to help
Attempted to install with this openshift-install version: 4.7.0-0.okd-2020-11-18-131704
DEBUG module.dns.azureprivatedns_zone_virtual_network_link.network: Still creating... [10s elapsed] DEBUG module.dns.azureprivatedns_zone_virtual_network_link.network: Still creating... [20s elapsed] DEBUG module.dns.azureprivatedns_zone_virtual_network_link.network: Still creating... [30s elapsed] DEBUG module.dns.azureprivatedns_zone_virtual_network_link.network: Creation complete after 31s [id=/subscriptions/${REDACTED}/resourceGroups/${REDACTED}-s8gs5-rg/providers/Microsoft.Network/privateDnsZones/${REDACTED}.${REDACTED}/virtualNetworkLinks/${REDACTED}-s8gs5-network-link] ERROR ERROR Error: Error creating Blob "rhcosph7tx.vhd" (Container "vhd" / Account "clusterph7tx"): Error opening source file for upload "https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/32.20200923.3.0/x86_64/fedora-coreos-32.20200923.3.0-azure.x86_64.vhd.xz": open https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/32.20200923.3.0/x86_64/fedora-coreos-32.20200923.3.0-azure.x86_64.vhd.xz: no such file or directory ERROR ERROR on ../../tmp/openshift-install-337770606/main.tf line 181, in resource "azurerm_storage_blob" "rhcos_image": ERROR 181: resource "azurerm_storage_blob" "rhcos_image" { ERROR ERROR FATAL failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: failed to complete the change RETURN_CODE is: 1
Significant finding -- older version of openshift-install receives the following:
### azurerm_storage_blob.rhcos_image: Creation complete after 1m41s
BACKGROUND: I have been attempting numerous openshift-install versions attempting to determine when the openshift-install started failing when installing OKD into Azure private network -- even though a RedHat Openshift cluster successfully installs (using same install-config.yaml).
The openshift-install version that reports "_azurerm_storage_blob.rhcosimage: Creation complete after 1m41s"
./openshift-install version ./openshift-install 4.5.0-0.okd-2020-09-18-202631 built from commit 63200c80c431b8dbaa06c0cc13282d819bd7e5f8 release image quay.io/openshift/okd@sha256:5fd1fe9707a9a4f53c8ccafad0cf44824a3a0b51e197f3fbc98d0884a9ddcf4f
DEBUG module.dns.azureprivatedns_zone_virtual_network_link.network: Still creating... [30s elapsed]
DEBUG module.dns.azureprivatedns_zone_virtual_network_link.network: Creation complete after 32s [id=/subscriptions/${REDACTED}/resourceGroups/${REDACTED}-8mdk2-rg/providers/Microsoft.Network/privateDnsZones/${REDACTED}.${REDACTED}/virtualNetworkLinks/${REDACTED}-8mdk2-network-link]
DEBUG azurerm_storage_blob.rhcos_image: Still creating... [50s elapsed]
DEBUG azurerm_storage_blob.rhcos_image: Still creating... [1m0s elapsed]
DEBUG azurerm_storage_blob.rhcos_image: Still creating... [1m10s elapsed]
DEBUG azurerm_storage_blob.rhcos_image: Still creating... [1m20s elapsed]
DEBUG azurerm_storage_blob.rhcos_image: Still creating... [1m30s elapsed]
DEBUG azurerm_storage_blob.rhcos_image: Still creating... [1m40s elapsed]
DEBUG azurerm_storage_blob.rhcos_image: Creation complete after 1m41s [id=https://clusterfgg2i.blob.core.windows.net/vhd/rhcosfgg2i.vhd]
I'm not entirely sure about it, but maybe this revert change will help: https://github.com/openshift/installer/pull/4395
@vrutkovs or @LorbusChris -- can this issue be assigned -- I am anxious to provide more information towards resolution.
Hi @brianotte and @lorbuschris,
because of FCOS not being able on the Azure Marketplace at least in the past it was necessary to do this steps in the installer on the fcos branch:
Because the FCOS image must be uploaded from the local file system, in main.tf the resource azurerm_storage_blob must contain the field "source" instead of "source_uri".
Look here: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/storage_blob
The code in
https://github.com/openshift/installer/blob/fcos/pkg/tfvars/azure/azure.go
was force pushed over again. That happened for the third time now :-)
Here are the interesting code snippets that seem to be missing in the current fcos branch of the installer:
import (
...
"github.com/openshift/installer/pkg/tfvars/internal/cache"
...
)
// TFVars generates Azure-specific Terraform variables launching the cluster.
func TFVars(sources TFVarsSources) ([]byte, error) {
...
cachedImage, err := cache.DownloadImageFile(sources.ImageURL)
if err != nil {
return nil, errors.Wrap(err, "failed to use cached Azure image")
}
cfg := &config{
...
ImageURL: cachedImage,
...
}
https://github.com/jomeier/installer/blob/fcos/pkg/tfvars/azure/azure.go
What is the path for getting this to work?
You mean the strategy or the path on the local file system ?
If I can download the file locally -- where should it exist to get this to work? or Is there a patch required for the openshift-install to get this to work? or Is there any other configurations that can occur to get this to work?
I guess I understand it is broken -- but in 4.5.0-0.okd-2020-09-18-202631 it uploaded just fine -- but in 4.6+ it fails.
How can we get to a point where it is working again?
Give me a second ...
@brianotte
The installer will be compiled. You can find the binary in ./bin/openshift-install
Give me a note if it works. Can't try it on my own.
Will try as indicated -- to determine functionality. Thank you.
Sure. If you tell me, that it works, I will create a PR.
Compiled. Now working to test install.
./openshift-install version ./openshift-install unreleased-master-3675-ga603c1d9c696a327e6d3f013211c3ebbf4070cfc built from commit a603c1d9c696a327e6d3f013211c3ebbf4070cfc release image registry.svc.ci.openshift.org/origin/release:4.6
This is new -- and now progressing past this part...
DEBUG Generating Terraform Variables...
INFO Obtaining RHCOS image file from 'https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/32.20201004.3.0/x86_64/fedora-coreos-32.20201004.3.0-azure.x86_64.vhd.xz'
DEBUG Unpacking file into "/zaphod/.cache/openshift-installer/image_cache/17fc72fc3c94492707e2104b8dd50d22"...
DEBUG decompressing the image archive as xz
Will continue to monitor and relay status.
I inspected the Azure console and in the cluster's resource group I see that the bootstrap and the masters exist. The following message for the 3 masters has scrolled:
Creation complete after 11m39s
The install progresses farther -- but it hangs on these lines:
DEBUG module.bootstrap.azurerm_linux_virtual_machine.bootstrap: Still creating... [33m50s elapsed] DEBUG module.bootstrap.azurerm_linux_virtual_machine.bootstrap: Still creating... [34m0s elapsed]
Even though the masters are reported as the following: Creation complete after 11m39s
I suspect that this fix is good -- but that there is another issue related to installing into Azure Private network. I will enable trace logging to determine if there is more information I can capture.
Please let me know if you have what you need for this fix -- or if there is more data I can provide.
The image is downloaded o.k. -- and believed to upload as the bootstrap and masters are created. The masters report as completed (all three). But the bootstrap is never reported as up -- and thus the cluster install into Azure private network does not succeed (although now we get farther before failing leaving great hope -- but not experiencing success).
Please let me know how you want to proceed.
Here are steps taken so far:
Then some time later the 3 masters are marked online:
time="2020-11-19T15:10:39-05:00" level=debug msg="module.master.azurerm_linux_virtual_machine.master[2]: Creation complete after 10m39s [id=/subscriptions/${REDACTED}/resourceGroups/${REDACTED}-dx5qg-rg/providers/Microsoft.Compute/virtualMachines/${REDACTED}-dx5qg-master-2]"
But the bootstrap never gets past this point:
time="2020-11-19T15:15:34-05:00" level=debug msg="2020/11/19 15:15:34 [TRACE] dag/walk: vertex \"meta.count-boundary (EachMode fixup)\" is waiting for \"module.bootstrap.azurerm_linux_virtual_machine.bootstrap\""
time="2020-11-19T15:15:34-05:00" level=debug msg="2020/11/19 15:15:34 [TRACE] dag/walk: vertex \"provider.azurerm (close)\" is waiting for \"module.bootstrap.azurerm_linux_virtual_machine.bootstrap\""
time="2020-11-19T15:15:35-05:00" level=debug msg="2020/11/19 15:15:35 [TRACE] dag/walk: vertex \"root\" is waiting for \"meta.count-boundary (EachMode fixup)\""
time="2020-11-19T15:15:39-05:00" level=debug msg="2020/11/19 15:15:39 [TRACE] dag/walk: vertex \"provider.azurerm (close)\" is waiting for \"module.bootstrap.azurerm_linux_virtual_machine.bootstrap\""
time="2020-11-19T15:15:39-05:00" level=debug msg="2020/11/19 15:15:39 [TRACE] dag/walk: vertex \"meta.count-boundary (EachMode fixup)\" is waiting for \"module.bootstrap.azurerm_linux_virtual_machine.bootstrap\""
time="2020-11-19T15:15:40-05:00" level=debug msg="module.bootstrap.azurerm_linux_virtual_machine.bootstrap: Still creating... [15m40s elapsed]"
time="2020-11-19T15:15:40-05:00" level=debug msg="2020/11/19 15:15:40 [TRACE] dag/walk: vertex \"root\" is waiting for \"meta.count-boundary (EachMode fixup)\""
time="2020-11-19T15:15:44-05:00" level=debug msg="2020/11/19 15:15:44 [TRACE] dag/walk: vertex \"meta.count-boundary (EachMode fixup)\" is waiting for \"module.bootstrap.azurerm_linux_virtual_machine.bootstrap\""
time="2020-11-19T15:15:44-05:00" level=debug msg="2020/11/19 15:15:44 [TRACE] dag/walk: vertex \"provider.azurerm (close)\" is waiting for \"module.bootstrap.azurerm_linux_virtual_machine.bootstrap\""
time="2020-11-19T15:15:45-05:00" level=debug msg="2020/11/19 15:15:45 [TRACE] dag/walk: vertex \"root\" is waiting for \"meta.count-boundary (EachMode fixup)\""
I would say that the image upload succeeded. Will create a PR.
Thank you good sir.
You're welcome.
NOTE: Once the above pull request is published -- please let me know as I have this set in the install-config.yaml
publish: Internal
Yet the openshift-install binary is is attempting to reach out to a public IP when attempting to determine if the bootstrap machine is available:
netstat -an | grep 443 | grep tcp | grep ESTABLISHED tcp 0 0 ${AZURE_IP_BOX}:39675 ${PUBLIC_IP}:443 ESTABLISHED
Describe the bug When installing into an AZURE private network space, openshift-install fails with the following error:
time="2020-11-04T14:47:56-05:00" level=debug msg="2020/11/04 14:47:56 [DEBUG] azurerm_storage_blob.rhcos_image: apply errored, but we're indicating that via the Error pointer rather than returning it: Error creating Blob \"rhcosk1vi6.vhd\" (Container \"vhd\" / Account \"clusterk1vi6\"): Error opening source file for upload \"https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/32.20201004.3.0/x86_64/fedora-coreos-32.20201004.3.0-azure.x86_64.vhd.xz\":open https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/32.20201004.3.0/x86_64/fedora-coreos-32.20201004.3.0-azure.x86_64.vhd.xz: no such file or directory"
I have tested getting that image with curl and wget (so direct connect connectivity works):
...from the system that is running openshift-install:
TESTING curl [zaphod@beeblebrox scripts]# curl -o fedora-coreos-32.20201004.3.0-azure.x86_64.vhd.xz https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/32.20201004.3.0/x86_64/fedora-coreos-32.20201004.3.0-azure.x86_64.vhd.xz % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 526M 100 526M 0 0 27.1M 0 0:00:19 0:00:19 --:--:-- 23.0M
TESTING wget [zaphod@beeblebrox scripts]# wget https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/32.20201004.3.0/x86_64/fedora-coreos-32.20201004.3.0-azure.x86_64.vhd.xz --2020-11-04 16:06:14-- https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/32.20201004.3.0/x86_64/fedora-coreos-32.20201004.3.0-azure.x86_64.vhd.xz Resolving builds.coreos.fedoraproject.org (builds.coreos.fedoraproject.org)... 99.86.230.128, 99.86.230.75, 99.86.230.86, ... Connecting to builds.coreos.fedoraproject.org (builds.coreos.fedoraproject.org)|99.86.230.128|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 552118248 (527M) [application/x-xz] Saving to: ‘fedora-coreos-32.20201004.3.0-azure.x86_64.vhd.xz.1’
100%[=======================================================================================================================================================================>] 552,118,248 139MB/s in 3.9s
2020-11-04 16:06:18 (134 MB/s) - ‘fedora-coreos-32.20201004.3.0-azure.x86_64.vhd.xz.1’ saved [552118248/552118248]
CONFIRMATION FILES EXIST FROM curl AND wget [zaphod@beeblebrox scripts]# ls -al | grep xz -rw-r--r--. 1 root root 552118248 Nov 4 16:06 fedora-coreos-32.20201004.3.0-azure.x86_64.vhd.xz -rw-r--r--. 1 root root 552118248 Oct 19 14:56 fedora-coreos-32.20201004.3.0-azure.x86_64.vhd.xz.1 -->
Version
IPI install method attempted.
./openshift-install version ./openshift-install 4.6.0-0.okd-2020-11-04-104427 built from commit c3816ca357a337d3396345d8c69b19fbde219884 release image registry.svc.ci.openshift.org/origin/release@sha256:6f55c950a7b66e85481f6cdd64e835965b6019e292f621c2558923e9bba126a7
How reproducible
100% reproducible.
Log bundle
Did not get oc adm must-gather as cluster did not install.
Here is install-config-yaml used: