metal3-io / baremetal-operator

Bare metal host provisioning integration for Kubernetes
Apache License 2.0
570 stars 253 forks source link

All Metal3 Centos E2E main tests fail with #1685 #1785

Closed tuminoid closed 3 months ago

tuminoid commented 3 months ago

What steps did you take and what happened: https://github.com/metal3-io/baremetal-operator/pull/1685 was merged, and since then all Metal3 Centos based e2e tests on main branch have failed. If the PR is reverted, they work.

What did you expect to happen: Centos e2e succeeds.

Anything else you would like to add: Ubuntu variants pass (given that #1780 is merged to fix one issue), so this is isolated to Centos.

Environment: Dev-env / CI, e2e integration, e2e feature, e2e ephemeral, bml e2e periodics all fail. All PR jobs with centos-e2e-integration-main fail

See https://jenkins.nordix.org/view/Metal3%20Periodic/job/metal3-periodic-centos-e2e-integration-test-main/87/ or any other periodic centos main job.

/kind bug

tuminoid commented 3 months ago

/triage accepted

/cc @dtantsur @elfosardo @MahnoorAsghar @mboukhalfa @Rozzii @kashifest FYI

tuminoid commented 3 months ago

Notable difference in BMO logs is

"level":"info","ts":1718358988.3083067,"logger":"provisioner.ironic","msg":"error caught while checking endpoint, will retry","host":"metal3~node-0","endpoint":"https://172.22.0.2:6385/v1/","error":"Expected HTTP response code [200 300] when accessing [GET https://172.22.0.2:6385/v1/], but got 503 instead: <!DOCTYPE HTML PUBLIC \"-//IETF//DTD HTML 2.0//EN\">\n<html><head>\n<title>503 Service Unavailable</title>\n</head><body>\n<h1>Service Unavailable</h1>\n<p>The server is temporarily unable to service your\nrequest due to maintenance downtime or capacity\nproblems. Please try again later.</p>\n</body></html>"}
{"level":"info","ts":1718358988.3096807,"logger":"controllers.BareMetalHost","msg":"provisioner is not ready","baremetalhost":{"name":"node-0","namespace":"metal3"},"RequeueAfter:":30}
{"level":"info","ts":1718358988.3113363,"logger":"provisioner.ironic","msg":"error caught while checking endpoint, will retry","host":"metal3~node-1","endpoint":"https://172.22.0.2:6385/v1/","error":"Expected HTTP response code [200 300] when accessing [GET https://172.22.0.2:6385/v1/], but got 503 instead: <!DOCTYPE HTML PUBLIC \"-//IETF//DTD HTML 2.0//EN\">\n<html><head>\n<title>503 Service Unavailable</title>\n</head><body>\n<h1>Service Unavailable</h1>\n<p>The server is temporarily unable to service your\nrequest due to maintenance downtime or capacity\nproblems. Please try again later.</p>\n</body></html>"}

that occurs on main only, but not with patch reverted. Code path looks like its going to retry, but never recovers, only spams provisioner is not ready, while the reverted tests shows that after a while of provisioner is not ready it goes to next provisioner state.

Rozzii commented 3 months ago

I hope this will fix it or at least move us closer : https://github.com/metal3-io/baremetal-operator/pull/1786