kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
419 stars 210 forks source link

MPICH support #562

Closed sheevy closed 1 year ago

sheevy commented 1 year ago

This is meant to be a continuation of https://github.com/kubeflow/mpi-operator/pull/478 The original PR was stale, and master moved a lot, so it was easier to just create a new PR. I hope that is ok.

These are meant to be the same changes as in the https://github.com/kubeflow/mpi-operator/pull/478, but rebased on top of current master. The main problem with previous PR was the fact that SlotsPerWorker used enviroment variable to control number of slots, but unfortunately such variable does not exist in case for MPICH. Suggested solution was to add number of slots per worker to hostfile. This PR does not implement this, because it was already done in https://github.com/kubeflow/mpi-operator/pull/523

I hope that's correct understanding.

tenzen-y commented 1 year ago

Thank you for creating this PR! I will review this tomorrow.

tenzen-y commented 1 year ago

@terrytangyuan Can you approve CI?

terrytangyuan commented 1 year ago

Looks failed

tenzen-y commented 1 year ago

@sheevy Can you address the error in CI?

sheevy commented 1 year ago

Do you have any hints where to look? I couldn't see anything in the logs which would point in the right direction

tenzen-y commented 1 year ago

Do you have any hints where to look? I couldn't see anything in the logs which would point in the right direction

CI says we need to regenerate manifests. Did you run make generate on your local?

sheevy commented 1 year ago

I have not. Thanks for the hint, I will check that tomorrow.

tenzen-y commented 1 year ago

Also, you can reproduce error with make verify-generate on your local.

https://github.com/kubeflow/mpi-operator/blob/fda0532ba1a25e48b68316759c2dec438cfc8d13/Makefile#L89-L91

tenzen-y commented 1 year ago

Also, can you update the PR title with MPICH support?

alculquicondor commented 1 year ago

Thanks @tenzen-y. I will check on Monday.

sheevy commented 1 year ago

I think I've implemented all the comments and suggestions. I'm happy for it to get re-tested. Please let me know if you have any further feedback.

sheevy commented 1 year ago

Hey @terrytangyuan, can you approve another run, fixes for all suggestions are in?

terrytangyuan commented 1 year ago

Sure

google-oss-prow[bot] commented 1 year ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubeflow/mpi-operator/blob/master/OWNERS)~~ [alculquicondor] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment