kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.58k stars 688 forks source link

support training on Ascend-NPU and Mindspore deep-learning frame #1966

Open ShepherdCheung opened 9 months ago

ShepherdCheung commented 9 months ago

training on huawei Ascend NPU need envs which are different from GPU. For example,tensorflowJob require envs include CM_CHIEF_IP, CM_CHIEF_ADDR. For details, see the following link:https://www.hiascend.com/en/document/detail/en/CANNCommunityEdition/600alphaX/tfmoddevg/tfmigr2/tfmigr2_000116.html.

In addition, can you support mindspore frame?

tenzen-y commented 9 months ago

/kind feature

The training operator doesn't inject those envvars into pods, and I think we shouldn't do so since we shouldn't lock in specific vendors.

Maybe we can provide a commonized configuration via ConfigMap and CRDs to set custom envvars to each role pod (Cheif/Worker) for ASIC, Cloud Vendor, and so on.

tenzen-y commented 9 months ago

cc: @kubeflow/wg-training-leads @kuizhiqing

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tenzen-y commented 6 months ago

/lifecycle frozen