kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.62k stars 700 forks source link

Pin Gloo repository in JAX Dockerfile to a specific commit #2329

Closed sandipanpanda closed 1 week ago

sandipanpanda commented 1 week ago

Pin the Gloo repository to a specific commit in the JAX Dockerfile to prevent build failures caused by a recent bug introduced in the Gloo codebase. By locking the version of Gloo to a known working commit, we ensure that the JAX build remains stable and functional until the issue is resolved upstream.

The build failure occurs when compiling the gloo/transport/tcp/buffer.cc file due to an undefined __NR_gettid constant, which was introduced after the pinned commit.

Related Issue/Context:

Thanks to @andreyvelich for reporting the issue.

coveralls commented 1 week ago

Pull Request Test Coverage Report for Build 11873367192

Details


Totals Coverage Status
Change from base Build 11758410179: 0.0%
Covered Lines: 77
Relevant Lines: 77

💛 - Coveralls
google-oss-prow[bot] commented 1 week ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubeflow/training-operator/blob/master/OWNERS)~~ [andreyvelich] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment