Open ShepherdCheung opened 9 months ago
/kind feature
The training operator doesn't inject those envvars into pods, and I think we shouldn't do so since we shouldn't lock in specific vendors.
Maybe we can provide a commonized configuration via ConfigMap and CRDs to set custom envvars to each role pod (Cheif/Worker) for ASIC, Cloud Vendor, and so on.
cc: @kubeflow/wg-training-leads @kuizhiqing
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/lifecycle frozen
training on huawei Ascend NPU need envs which are different from GPU. For example,tensorflowJob require envs include CM_CHIEF_IP, CM_CHIEF_ADDR. For details, see the following link:https://www.hiascend.com/en/document/detail/en/CANNCommunityEdition/600alphaX/tfmoddevg/tfmigr2/tfmigr2_000116.html.
In addition, can you support mindspore frame?