kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
419 stars 210 forks source link

Update version of debian for Docker images #573

Open sheevy opened 1 year ago

sheevy commented 1 year ago

Bullseye -> Bookworm

alculquicondor commented 1 year ago

It looks like MPICH is failing, but others are passing.

sheevy commented 1 year ago

For me it was Intel which failed locally. I've started this PR to ask if people have any hints what could be happening?

google-oss-prow[bot] commented 1 year ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubeflow/mpi-operator/blob/master/OWNERS)~~ [terrytangyuan] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
alculquicondor commented 1 year ago

From:

https://github.com/kubeflow/mpi-operator/actions/runs/5346361756/jobs/9693279621?pr=573 (1.25)

== BEGIN pi-launcher-qlgqn pod logs ==
Couldn't resolve pi-launcher... Retrying
Couldn't resolve pi-launcher... Retrying
Couldn't resolve pi-launcher... Retrying
Couldn't resolve pi-launcher... Retrying
Couldn't resolve pi-launcher... Retrying
Couldn't resolve pi-launcher... Retrying
Couldn't resolve pi-launcher... Retrying
Couldn't resolve pi-launcher... Retrying
Couldn't resolve pi-launcher... Retrying
Resolved pi-launcher
Resolved pi-worker-0.pi-worker.e2e-gs9st.svc
Resolved pi-worker-1.pi-worker.e2e-gs9st.svc
Warning: Permanently added '[pi-worker-0.pi-worker.e2e-gs9st.svc]:2222' (ED25519) to the list of known hosts.
Warning: Permanently added '[pi-worker-1.pi-worker.e2e-gs9st.svc]:2222' (ED25519) to the list of known hosts.
Workers: 2
Rank 1 on host pi-worker-1
Rank 0 on host pi-worker-0
pi is approximately 3.1410376000000002

It kind of looks like the job succeeded?

No idea what could be happening. Maybe the timeouts are just too tight?

google-oss-prow[bot] commented 1 year ago

New changes are detected. LGTM label has been removed.

alculquicondor commented 1 year ago

Still failing... can you investigate locally?

tenzen-y commented 10 months ago

@sheevy Do you have any progress? We need to move this forward to update the OpenMPI version. ref: #588