kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.51k stars 660 forks source link

Support ARM64 platform in TensorFlow examples #2119

Closed akhilsaivenkata closed 1 month ago

akhilsaivenkata commented 1 month ago

What this PR does / why we need it: Support ARM64 platform in TensorFlow examples

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged): Fixes #2112

Checklist:

coveralls commented 1 month ago

Pull Request Test Coverage Report for Build 9163985221

Details


Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/mpi/mpijob.go 1 91.06%
<!-- Total: 1 -->
Totals Coverage Status
Change from base Build 9130601320: -0.008%
Covered Lines: 4373
Relevant Lines: 12362

💛 - Coveralls
akhilsaivenkata commented 1 month ago

@tenzen-y , the check is failing because 'libhdf5.so' library is missing in the build environment. So do we need to make any changes to the docker file or is there any workaround?

tenzen-y commented 1 month ago

@tenzen-y , the check is failing because 'libhdf5.so' library is missing in the build environment. So do we need to make any changes to the docker file or is there any workaround?

Yes, feel free to address that issue. I'm suspecting if bumping tf version would resolve the issue.

akhilsaivenkata commented 1 month ago

@tenzen-y , the check is failing because 'libhdf5.so' library is missing in the build environment. So do we need to make any changes to the docker file or is there any workaround?

Yes, feel free to address that issue. I'm suspecting if bumping tf version would resolve the issue.

Here we are using python 3.9 as base image and we are facing issue with tensor flow installation :https://github.com/kubeflow/training-operator/blob/master/examples/tensorflow/distribution_strategy/keras-API/Dockerfile

For remaining tensor flow examples we are using tensorflow as base image which would come with all its dependencies. Is there any reason for using python as base image for the above case?

akhilsaivenkata commented 1 month ago

Hi @tenzen-y , All checks are successful for this PR, Could you please review and possibly merge the pull request if everything is in order? Thank you for your time and assistance.

google-oss-prow[bot] commented 1 month ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubeflow/training-operator/blob/master/OWNERS)~~ [tenzen-y] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment