issues
search
kubeflow
/
training-operator
Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.62k
stars
700
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Upgrade Kubernetes to v1.30.7
#2332
astefanutti
opened
51 minutes ago
2
How can I change the default MASTER_ADDR in Pytorchjob?
#2331
Jmengfei
opened
9 hours ago
0
Upgrade Kubernetes to v1.31.3
#2330
astefanutti
opened
18 hours ago
5
Pin Gloo repository in JAX Dockerfile to a specific commit
#2329
sandipanpanda
closed
4 days ago
2
KEP-2170: Add Torch Distributed Runtime
#2328
andreyvelich
opened
6 days ago
2
pytorchjob didn't create worker pod ,seems hang
#2327
Twilighter9527
opened
1 week ago
9
Upgrade kustomization files to Kustomize v5
#2326
oksanabaza
opened
1 week ago
1
Custom Volcano Queues not working with MPIJob V1
#2325
ameya-parab
closed
5 hours ago
4
KEP-2170: [SDK] Initial implementation of the Kubeflow Training V2 Python SDK
#2324
andreyvelich
opened
1 week ago
5
KEP-2170: Add unit and E2E tests for model and dataset initializers
#2323
seanlaii
opened
1 week ago
2
KEP-2170: Add TrainJob conditions
#2322
tenzen-y
closed
1 week ago
7
KEP-2170: Design Trainer for the LLM Runtimes
#2321
andreyvelich
opened
2 weeks ago
3
Validate pytorchjob workers are configured when elasticpolicy is configured
#2320
tarat44
opened
2 weeks ago
4
Update TF examples to Keras V3
#2319
YosiElias
opened
2 weeks ago
1
KEP-2170: Support hundreds and thousands worker nodes for a single training Job
#2318
tenzen-y
opened
2 weeks ago
1
[fix] Resolve v2alpha API exceptions
#2317
varshaprasad96
closed
5 hours ago
4
KEP-2170: Implement Initializer builders in the JobSet plugin
#2316
andreyvelich
closed
2 weeks ago
3
commonize job name validation
#2315
akagami-harsh
opened
3 weeks ago
1
Kubeflow Training Operator Logo
#2314
andreyvelich
opened
3 weeks ago
14
KEP: 2170: Adding cel validations on TrainingRuntime/ClusterTrainingRuntime CRDs
#2313
akshaychitneni
opened
3 weeks ago
3
Update Dockerfile with python debian image in cmd/initializer_v2/dataset/Dockerfile
#2312
mani1soni
opened
3 weeks ago
5
Use Debian images for Python components in the Training Operator V2
#2311
andreyvelich
opened
3 weeks ago
5
KEP-2170: Generate Python SDK for Kubeflow Training V2
#2310
andreyvelich
closed
3 weeks ago
7
WIP: Use SSA in TrainJob Controller
#2309
varshaprasad96
opened
3 weeks ago
1
KEP-2170: Implement JobSet, PlainML, and Torch Plugins
#2308
andreyvelich
closed
3 weeks ago
6
KEP-2170: Adding validation webhook for v2 trainjob
#2307
akshaychitneni
opened
4 weeks ago
3
KEP-2170: Initialize runtimes before the manager starts
#2306
tenzen-y
closed
4 weeks ago
4
KEP-2170: Add unit and E2E tests for model and dataset initializers
#2305
andreyvelich
opened
4 weeks ago
3
KEP-2170: Strictly verify the CRD marker validation and defaulting in the integration testings
#2304
tenzen-y
closed
4 weeks ago
5
KEP-2170: Create model and dataset initializers
#2303
andreyvelich
closed
3 weeks ago
7
Upgrade Go version to v1.23
#2302
tenzen-y
opened
4 weeks ago
3
Remove Prometheus Monitoring doc
#2301
sophie0730
closed
4 weeks ago
2
Pytorch job running with pod exception unable to recover after retry
#2300
shaoqingyang
opened
1 month ago
3
Upgrade 1.30
#2299
kannon92
opened
1 month ago
9
KEP-2170: Add the TrainJob state transition design
#2298
tenzen-y
closed
2 weeks ago
7
KEP-2170: Replace UPSERT operation for the objects with SSA PATCH
#2297
tenzen-y
opened
1 month ago
7
KEP-2170: Decouple JobSet from TrainJob
#2296
tenzen-y
closed
4 weeks ago
5
KEP-2170: Implement TrainJob Reconciler to manage objects
#2295
tenzen-y
closed
1 month ago
6
Upgrade Deepspeed demo dependencies
#2294
Syulin7
closed
1 month ago
3
Add strict validation on error messages in tests for v2 APIs
#2293
akshaychitneni
closed
4 weeks ago
3
Adapt the manifests to kustomize v5
#2292
tenzen-y
opened
1 month ago
4
Support Kubernetes v1.29 - v1.31 or v1.28 - v1.31
#2291
tenzen-y
opened
1 month ago
7
KEP-2170: Implement Job Pipeline Framework plugins
#2290
tenzen-y
opened
1 month ago
2
KEP-2170: Add manifests for Kubeflow Training V2
#2289
andreyvelich
closed
1 month ago
3
Bump transformers from 4.5.1 to 4.38.0 in /examples/pytorch/deepspeed-demo
#2288
dependabot[bot]
closed
1 month ago
6
Bump tqdm from 4.62.3 to 4.66.3 in /examples/pytorch/deepspeed-demo
#2287
dependabot[bot]
closed
1 month ago
5
FSDP Example for T5 Fine-Tuning and PyTorchJob
#2286
andreyvelich
closed
1 month ago
9
adding env vars
#2285
tarekabouzeid
opened
1 month ago
2
Add environment variables to containers
#2284
tarekabouzeid
opened
1 month ago
0
KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API
#2283
andreyvelich
closed
1 month ago
4
Next