Open sheevy opened 1 year ago
It looks like MPICH is failing, but others are passing.
For me it was Intel which failed locally. I've started this PR to ask if people have any hints what could be happening?
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: terrytangyuan
The full list of commands accepted by this bot can be found here.
The pull request process is described here
From:
https://github.com/kubeflow/mpi-operator/actions/runs/5346361756/jobs/9693279621?pr=573 (1.25)
== BEGIN pi-launcher-qlgqn pod logs ==
Couldn't resolve pi-launcher... Retrying
Couldn't resolve pi-launcher... Retrying
Couldn't resolve pi-launcher... Retrying
Couldn't resolve pi-launcher... Retrying
Couldn't resolve pi-launcher... Retrying
Couldn't resolve pi-launcher... Retrying
Couldn't resolve pi-launcher... Retrying
Couldn't resolve pi-launcher... Retrying
Couldn't resolve pi-launcher... Retrying
Resolved pi-launcher
Resolved pi-worker-0.pi-worker.e2e-gs9st.svc
Resolved pi-worker-1.pi-worker.e2e-gs9st.svc
Warning: Permanently added '[pi-worker-0.pi-worker.e2e-gs9st.svc]:2222' (ED25519) to the list of known hosts.
Warning: Permanently added '[pi-worker-1.pi-worker.e2e-gs9st.svc]:2222' (ED25519) to the list of known hosts.
Workers: 2
Rank 1 on host pi-worker-1
Rank 0 on host pi-worker-0
pi is approximately 3.1410376000000002
It kind of looks like the job succeeded?
No idea what could be happening. Maybe the timeouts are just too tight?
New changes are detected. LGTM label has been removed.
Still failing... can you investigate locally?
@sheevy Do you have any progress? We need to move this forward to update the OpenMPI version. ref: #588
Bullseye -> Bookworm