Closed bparees closed 6 years ago
It can't pull the image:
Back-off pulling image "openshift/origin-web-console:51e2775"
...
Error from server (BadRequest): container "webconsole" in pod "webconsole-56d6b94669-rtp6w" is waiting to start: trying and failing to pull image
@stevekuznetsov How are the images built for these jobs?
/assign @stevekuznetsov
this is consistently breaking our nightly test job.
/cc @wozniakjan
breaking the jenkins plugin PR test jobs as well
@spadgett these jobs should be extending the same base job configuration as our other conformance jobs that run extended tests.
(they extend parent: 'common/test_cases/origin_installed_release.yml')
/cc @jwforres @jupierce
We started pulling the console image by SHA, not sure where or how... but the build AMI that the jobs were based off of was not building the full ecosystem. I updated that AMI job to build everything in https://github.com/openshift/aos-cd-jobs/pull/1280 and kicked off a build here: https://ci.openshift.redhat.com/jenkins/job/ami_build_origin_int_rhel_build/2525/
The jobs should be functional after that AMI is ready.
thanks @stevekuznetsov !
our jobs are passing again. @gabemontero if you're still having issues you think are related to this, please reopen it, but i think the main issue is resolved.
thanks again @stevekuznetsov
Yeah several plugins have had tests pass now ... so we are good with this one.
I was seeing one oddity with the sync plugin, but it seems unrelated, and I'm not sure yet if it is a flake or persistent.
On Tue, Apr 3, 2018 at 1:59 PM, Ben Parees notifications@github.com wrote:
Closed #19143 https://github.com/openshift/origin/issues/19143.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/19143#event-1554736792, or mute the thread https://github.com/notifications/unsubscribe-auth/ADbadBIj0Gr9IfKZxweZ_2yiLu10_jpQks5tk7hvgaJpZM4TASJs .
Happened again on Saturday, April 7th.
TASK [openshift_web_console : Verify that the web console is running] **********
task path: /usr/share/ansible/openshift-ansible/roles/openshift_web_console/tasks/start.yml:2
FAILED - RETRYING: Verify that the web console is running (60 retries left).
...
FAILED - RETRYING: Verify that the web console is running (1 retries left).
fatal: [localhost]: FAILED! => {
...
curl: (6) Could not resolve host: webconsole.openshift-web-console.svc; Name or service not
...
TASK [openshift_web_console : Report console errors] ***************************
task path: /usr/share/ansible/openshift-ansible/roles/openshift_web_console/tasks/start.yml:51
fatal: [localhost]: FAILED! => {
"changed": false,
"generated_timestamp": "2018-04-13 11:19:11.922200",
"msg": "Console install failed."
}
/cc @stevekuznetsov
This is a different issue. The deployment is successful and the images can be pulled, but for whatever reason, we aren't able to curl the service (even though it exists).
Could not resolve host: webconsole.openshift-web-console.svc; Name or service not known
@knobunc Any idea why curling the service from the master would fail when there are pods running and ready?
Best I can tell, this is a networking issue accessing the service. @knobunc who can help me debug?
/assign @knobunc
Having the same problem and google brought me here. Using the openshift ansible installer, everything works until the webconsole part which keels over:
FAILED - RETRYING: Verify that the web console is running (1 retries left).
fatal: [myhost]: FAILED! => {"attempts": 60, "changed": false, "cmd": ["curl", "-k", "https://webconsole.openshift-web-console.svc/healthz"], "delta": "0:00:05.518865", "end": "2018-04-23 16:53:53.693624", "msg": "non-zero return code", "rc": 6, "start": "2018-04-23 16:53:48.174759", "stderr": " % Total % Received % Xferd Average Speed Time Time Time Current\n Dload Upload Total Spent Left Speed\n\r 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0\r 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:03 --:--:-- 0\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:04 --:--:-- 0curl: (6) Could not resolve host: webconsole.openshift-web-console.svc; Name or service not known", "stderr_lines": [" % Total % Received % Xferd Average Speed Time Time Time Current", " Dload Upload Total Spent Left Speed", "", " 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0", " 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0", " 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0", " 0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0", " 0 0 0 0 0 0 0 0 --:--:-- 0:00:03 --:--:-- 0", " 0 0 0 0 0 0 0 0 --:--:-- 0:00:04 --:--:-- 0curl: (6) Could not resolve host: webconsole.openshift-web-console.svc; Name or service not known"], "stdout": "", "stdout_lines": []}
When I visit my openshift webconsole, I get the SSL connection but the webconsole itself never loads.
I have same error as @wozniakjan when installing OC using openshift-ansible installer and more precisely 3.9 release:
TASK [openshift_web_console : Verify that the web console is running] ******************************************************************************************************
Wednesday 25 April 2018 16:15:07 +0200 (0:00:00.041) 0:40:48.394 *******
FAILED - RETRYING: Verify that the web console is running (60 retries left).
.......
FAILED - RETRYING: Verify that the web console is running (1 retries left).
fatal: [ xx.xxxxx.xxxx.com]: FAILED! => {"attempts": 60, "changed": false, "cmd": ["curl", "-k", "https://webconsole.openshift-web-console.svc/healthz"], "delta": "0:00:00.066483", "end": "2018-04-25 16:25:36.610975", "msg": "non-zero return code", "rc": 6, "start": "2018-04-25 16:25:36.544492", "stderr": " % Total % Received % Xferd Average Speed Time Time Time Current\n Dload Upload Total Spent Left Speed\n\r 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (6) Could not resolve host: webconsole.openshift-web-console.svc; Nom ou service inconnu", "stderr_lines": [" % Total % Received % Xferd Average Speed Time Time Time Current", " Dload Upload Total Spent Left Speed", "", " 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (6) Could not resolve host: webconsole.openshift-web-console.svc; Nom ou service inconnu"], "stdout": "", "stdout_lines": []}
...ignoring
.......
........
TASK [openshift_web_console : Report console errors] ***********************************************************************************************************************
Wednesday 25 April 2018 16:25:55 +0200 (0:00:00.325) 0:51:36.899 *******
FAILED! => {"changed": false, "msg": "Console install failed."}
And on 3.8 release, i had approximately same error on the same task:
0curl: (7) Failed connect to webconsole.openshift-web-console.svc:443; Connexion refusée"]
=> but here the connexion is refused
And then:
TASK [openshift_web_console : Report console errors] ***********************************************************************************************************************
Wednesday 25 April 2018 19:26:19 +0200 (0:00:00.283) 0:48:06.670 *******
fatal: [xxx.xxxxxxx.com]: FAILED! => {"changed": false, "msg": "Console install failed."}
OK I've tried this 10 different ways and I can consistently duplicate the problem. Fire up that terraform and do this:
main.tf:
provider "aws" {
version = "~> 1.11.0"
region = "us-east-1"
}
variable "terraform_source_url" {
default = "https://my.private.repo.at.work"
}
terraform {
backend "s3" {
bucket = "terraform-state-storage"
encrypt = true
key = "openshift-dev/terraform.tfstate"
region = "us-east-1"
}
}
resource "aws_s3_bucket_object" "terraform_source_url" {
bucket = "terraform-state-storage"
content = "${var.terraform_source_url}"
key = "openshift-dev/terraform_source_url.txt"
server_side_encryption = "AES256"
tags {
terraform_source_url = "${var.terraform_source_url}"
}
}
openshift.tf - make the appropriate substitutions for your VPC:
module "openshift" {
source = "git::https://github.com/tibers/terraform-aws-openshift.git"
public_subnet_ids = ["subnet-something"]
private_subnet_ids = ["subnet-somethingelse"]
vpc_id = "vpc-something"
admin_ssh_key = "your id_rsa goes here"
management_net = "your.cidr.goes.here/24"
public_domain = "your.domain.goes.here"
// in house CentOS 7 base image
app_ami = "ami-something"
infra_ami = "ami-something"
master_ami = "ami-something"
provisioner_ami = "ami-something"
// Kosher CentOS 6 - https://wiki.centos.org/Cloud/AWS
// app_ami = "ami-e3fdd999"
// infra_ami = "ami-e3fdd999"
// master_ami = "ami-e3fdd999"
// provisioner_ami = "ami-e3fdd999"
// instance types
provisioner_instance_type = "m5.xlarge"
master_instance_type = "m5.2xlarge"
infra_instance_type = "m5.xlarge"
app_instance_type = "m5.xlarge"
// sizing
app_node_count = "1"
infra_node_count = "1"
master_node_count = "1"
// names
// names must begin with openshift* so the filter works
provisioner_name = "openshift_provisioner"
master_name = "openshift_master"
infra_name = "openshift_infra"
app_name = "openshift_app"
// spot price
provisioner_spot_price = "1.00"
master_spot_price = "1.00"
infra_spot_price = "1.00"
app_spot_price = "1.00"
}
Shell into the provisioner box and run ansible-playbook -i /var/provisioner /openshift-ansible/playbooks/deploy_cluster.yml
or check /var/provisioner/provisioner.log
.
Unless I am doing something seriously goofy, this all worked in 3.6, but is broken in 3.9.
@otmanel31 It looks like your error is happening because the node is out of disk space:
0/1 nodes are available: 1 NodeNotReady, 1 NodeOutOfDisk.
Ok thank you @spadgett ... i try it on virtual machine with memory requirements and it works. But this error persist on test server (physical machine) ... :)
Changing back to P1 for now since the original problem breaking the nightly tests was fixed.
Hey All,
Are there any updates on this or any possible or potential work around ? Do know what is the root cause ? Any help will be appreciated.
@junsaw Are you seeing the exact error as above? (Could not resolve host: webconsole.openshift-web-console.svc; Name or service not known)
You could set openshift_web_console_install=false
if it's blocking you, which skips console install. The web console playbook can be run at a later time.
ping @knobunc for help debugging this
I think the fix is just to check pod readiness instead of trying to curl the service from the master. The readiness probe will already check the health endpoint, so I don't see a real benefit to using curl.
https://github.com/openshift/openshift-ansible/pull/8274 changes how we verify the console is installed, which will workaround problems resolving the service hostname
@spadgett that allowed me to by pass the web_console installer.
Thanks Once Again.
@junsaw Thanks for verifying the fix. Can you confirm that the web console is actually working for your cluster?
Is this going to receive a change to the source to move away from the curl check? I'm seeing this problem frequently.
Yeah, we've switched to checking pod readiness instead in master. We're seeing this enough I think we should backport the change to 3.9.
@sdodson sound OK?
Yeah that sounds good to me.
Several of our extended test jobs failed to install/stand up due to the console being unavailable:
https://ci.openshift.redhat.com/jenkins/job/test_branch_origin_extended_builds/418/console https://ci.openshift.redhat.com/jenkins/job/test_branch_origin_extended_image_ecosystem/431/console
@spadgett @stevekuznetsov