openshift-metal3 / dev-scripts

Scripts to automate development/test setup for openshift integration with https://github.com/metal3-io/
Apache License 2.0
92 stars 182 forks source link

METAL-897: Use nmcli instead of legacy network scripts #1631

Closed elfosardo closed 5 months ago

elfosardo commented 5 months ago

/retest ofcir failure

elfosardo commented 5 months ago

/retest

elfosardo commented 5 months ago

/retest

elfosardo commented 5 months ago

/retest

elfosardo commented 5 months ago

/retest galaxy error, interesting!

elfosardo commented 5 months ago

/retest

elfosardo commented 5 months ago

/retest

elfosardo commented 5 months ago

/retest

elfosardo commented 5 months ago

/retest

elfosardo commented 5 months ago

/retest more ansible galaxy error, scary

elfosardo commented 5 months ago

/retest

mkowalski commented 5 months ago

Hey, the only thing that concerns me (but it may be completely invalid) - how am I supposed to "upgrade" after this commit merges? Should I run make clean (or make realclean) before pulling from master and only afterwards use it, or maybe doesn't matter?

I feel without clean before git pull I may have unwanted stuff in my /etc but not sure honestly how this is handled.

What I am trying to say - maybe in host_cleanup.sh we should leave (as non-failing) sudo rm -f /etc/sysconfig/network-scripts/ifcfg-[...] to handle systems that used old dev-scripts in the past?

elfosardo commented 5 months ago

Hey, the only thing that concerns me (but it may be completely invalid) - how am I supposed to "upgrade" after this commit merges? Should I run make clean (or make realclean) before pulling from master and only afterwards use it, or maybe doesn't matter?

I feel without clean before git pull I may have unwanted stuff in my /etc but not sure honestly how this is handled.

What I am trying to say - maybe in host_cleanup.sh we should leave (as non-failing) sudo rm -f /etc/sysconfig/network-scripts/ifcfg-[...] to handle systems that used old dev-scripts in the past?

@mkowalski that sounds like a good idea, I'll update the PR

elfosardo commented 5 months ago

/retest

elfosardo commented 5 months ago

/retest CI is not really ok at the moment

elfosardo commented 5 months ago

/retest

mkowalski commented 5 months ago

/lgtm Whenever CI passes, good to go

elfosardo commented 5 months ago

/retest

elfosardo commented 5 months ago

/retest ansible galaxy issue

elfosardo commented 5 months ago

/retest

elfosardo commented 5 months ago

/retest

elfosardo commented 5 months ago

/retest

elfosardo commented 5 months ago

/retest

elfosardo commented 5 months ago

/retest

elfosardo commented 5 months ago

/retest failure is not related to this change

elfosardo commented 5 months ago

/retest

elfosardo commented 5 months ago

/retest wow CI is so foobar at the moment

mkowalski commented 5 months ago

I am not sure if this error is really important here, I looked and cluster deploys but something somewhere fails afterwards,

 INFO[2024-02-16T15:01:57Z] Step e2e-metal-ipi-bm-baremetalds-devscripts-setup succeeded after 1h17m5s. 
INFO[2024-02-16T15:01:57Z] Step phase pre succeeded after 1h19m20s.     
INFO[2024-02-16T15:01:57Z] Running multi-stage phase test               
INFO[2024-02-16T15:01:57Z] Running step e2e-metal-ipi-bm-baremetalds-e2e-test. 
INFO[2024-02-16T16:17:19Z] Logs for container test in pod e2e-metal-ipi-bm-baremetalds-e2e-test: 
INFO[2024-02-16T16:17:19Z] time="2024-02-16T16:11:40Z" level=info msg="processed event" event="{{ } {foo-crd.17b463c4bc2ffd43  e2e-horizontal-pod-autoscaling-6430  ed339db1-94bc-4127-9842-38a3ebf7f32d 258089 0 2024-02-16 16:10:55 +0000 UTC <nil> <nil> map[] map[monitor.openshift.io/observed-recreation-count: monitor.openshift.io/observed-update-count:1] [] [] [{kube-controller-manager Update v1 2024-02-16 16:11:40 +0000 UTC FieldsV1 {\"f:count\":{},\"f:firstTimestamp\":{},\"f:involvedObject\":{},\"f:lastTimestamp\":{},\"f:message\":{},\"f:reason\":{},\"f:reportingComponent\":{},\"f:source\":{\"f:component\":{}},\"f:type\":{}} }]} {HorizontalPodAutoscaler e2e-horizontal-pod-autoscaling-6430 foo-crd a2e65cc7-43f1-4f19-a3bf-7a965e1ceb46 autoscaling/v2 257669 } FailedGetResourceMetric failed to get cpu utilization: did not receive metrics for targeted pods (pods might be unready) {horizontal-pod-autoscaler } 2024-02-16 16:10:55 +0000 UTC 2024-02-16 16:11:40 +0000 UTC 4 Warning 0001-01-01 00:00:00 +0000 UTC nil  nil horizontal-pod-autoscaler }" 

[...]

 Cleaning up.
found errors fetching in-cluster data: [failed to list files in disruption event folder on node host2.cluster5.ocpci.eng.rdu2.redhat.com: the server could not find the requested resource failed to list files in disruption event folder on node host3.cluster5.ocpci.eng.rdu2.redhat.com: the server could not find the requested resource failed to list files in disruption event folder on node host4.cluster5.ocpci.eng.rdu2.redhat.com: the server could not find the requested resource failed to list files in disruption event folder on node host5.cluster5.ocpci.eng.rdu2.redhat.com: the server could not find the requested resource failed to list files in disruption event folder on node host6.cluster5.ocpci.eng.rdu2.redhat.com: the server could not find the requested resource] 

[...]

 Failing tests:
[sig-cli] oc adm node-logs [Suite:openshift/conformance/parallel]
environment: line 123:   320 Killed                  openshift-tests run "${TEST_SUITE}" ${TEST_ARGS:-} --provider "${TEST_PROVIDER:-}" -o "${ARTIFACT_DIR}/e2e.log" --junit-dir "${ARTIFACT_DIR}/junit"
++ date +%s
+ echo 1708100239
{"component":"entrypoint","error":"wrapped process failed: exit status 137","file":"k8s.io/test-infra/prow/entrypoint/run.go:84","func":"k8s.io/test-infra/prow/entrypoint.Options.internalRun","level":"error","msg":"Error executing test process","severity":"error","time":"2024-02-16T16:17:19Z"}
error: failed to execute wrapped command: exit status 137 
INFO[2024-02-16T16:17:19Z] Step e2e-metal-ipi-bm-baremetalds-e2e-test failed after 1h15m22s. 

I can't see how this change would make cluster suddenly to fail conformance (if it really failed) but not break the installation

elfosardo commented 5 months ago

@mkowalski thank you for checking that it's weird that the error is showing up now as the CI was 100% passing last week, so I don't think the issue is due to this change I'm going to retest once more and see

elfosardo commented 5 months ago

/retest

elfosardo commented 5 months ago

/retest yet another unrelated failure

elfosardo commented 5 months ago

/retest

elfosardo commented 5 months ago

/retest

elfosardo commented 5 months ago

/retest

elfosardo commented 5 months ago

/retest

derekhiggins commented 5 months ago

/approve tested on CS9 with both ipv4 and ipv6

openshift-ci[bot] commented 5 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekhiggins

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift-metal3/dev-scripts/blob/master/OWNERS)~~ [derekhiggins] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment