issues
search
kubeflow
/
training-operator
Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.62k
stars
701
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
"zero-trust" security / networking for training jobs
#2341
astefanutti
opened
4 hours ago
0
Pin accelerate package version in trainer
#2340
gavrissh
closed
25 minutes ago
3
Ensure code generation dependencies are downloaded
#2339
astefanutti
opened
9 hours ago
1
Add openapi-generator CLI option to skip SDK v2 test generation
#2338
astefanutti
closed
1 day ago
2
Refine the server-side apply installation args
#2337
tenzen-y
closed
1 day ago
3
Ignore cache exporting errors in the image building workflows
#2336
tenzen-y
closed
1 day ago
3
KEP-2170: Add AMD ROCm Torch Distributed Training Runtime
#2335
astefanutti
opened
3 days ago
1
mpi job bug
#2334
fyxemmmm
opened
4 days ago
0
[bug]when running pipeline code, the pod DAG always stay in status Init:StartError
#2333
Epochex
opened
6 days ago
0
Upgrade Kubernetes to v1.30.7
#2332
astefanutti
closed
2 days ago
12
How can I change the default MASTER_ADDR in Pytorchjob?
#2331
Jmengfei
opened
1 week ago
2
Upgrade Kubernetes to v1.31.3
#2330
astefanutti
opened
1 week ago
7
Pin Gloo repository in JAX Dockerfile to a specific commit
#2329
sandipanpanda
closed
1 week ago
2
KEP-2170: Add Torch Distributed Runtime
#2328
andreyvelich
closed
1 day ago
5
pytorchjob didn't create worker pod ,seems hang
#2327
Twilighter9527
opened
2 weeks ago
9
Upgrade kustomization files to Kustomize v5
#2326
oksanabaza
closed
1 day ago
3
Custom Volcano Queues not working with MPIJob V1
#2325
ameya-parab
closed
1 week ago
4
KEP-2170: [SDK] Initial implementation of the Kubeflow Training V2 Python SDK
#2324
andreyvelich
opened
2 weeks ago
5
KEP-2170: Add unit and E2E tests for model and dataset initializers
#2323
seanlaii
opened
2 weeks ago
4
KEP-2170: Add TrainJob conditions
#2322
tenzen-y
closed
2 weeks ago
7
KEP-2170: Design Trainer for the LLM Runtimes
#2321
andreyvelich
opened
3 weeks ago
3
Validate pytorchjob workers are configured when elasticpolicy is configured
#2320
tarat44
opened
3 weeks ago
4
Update TF examples to Keras V3
#2319
YosiElias
opened
3 weeks ago
1
KEP-2170: Support hundreds and thousands worker nodes for a single training Job
#2318
tenzen-y
opened
3 weeks ago
1
[fix] Resolve v2alpha API exceptions
#2317
varshaprasad96
closed
1 week ago
4
KEP-2170: Implement Initializer builders in the JobSet plugin
#2316
andreyvelich
closed
3 weeks ago
3
commonize job name validation
#2315
akagami-harsh
opened
1 month ago
1
Kubeflow Training Operator Logo
#2314
andreyvelich
opened
1 month ago
14
KEP: 2170: Adding cel validations on TrainingRuntime/ClusterTrainingRuntime CRDs
#2313
akshaychitneni
opened
1 month ago
3
Update Dockerfile with python debian image in cmd/initializer_v2/dataset/Dockerfile
#2312
mani1soni
opened
1 month ago
5
Use Debian images for Python components in the Training Operator V2
#2311
andreyvelich
opened
1 month ago
5
KEP-2170: Generate Python SDK for Kubeflow Training V2
#2310
andreyvelich
closed
1 month ago
7
WIP: Use SSA in TrainJob Controller
#2309
varshaprasad96
opened
1 month ago
1
KEP-2170: Implement JobSet, PlainML, and Torch Plugins
#2308
andreyvelich
closed
4 weeks ago
6
KEP-2170: Adding validation webhook for v2 trainjob
#2307
akshaychitneni
opened
1 month ago
3
KEP-2170: Initialize runtimes before the manager starts
#2306
tenzen-y
closed
1 month ago
4
KEP-2170: Add unit and E2E tests for model and dataset initializers
#2305
andreyvelich
opened
1 month ago
3
KEP-2170: Strictly verify the CRD marker validation and defaulting in the integration testings
#2304
tenzen-y
closed
1 month ago
5
KEP-2170: Create model and dataset initializers
#2303
andreyvelich
closed
1 month ago
7
Upgrade Go version to v1.23
#2302
tenzen-y
opened
1 month ago
3
Remove Prometheus Monitoring doc
#2301
sophie0730
closed
1 month ago
2
Pytorch job running with pod exception unable to recover after retry
#2300
shaoqingyang
opened
1 month ago
3
Upgrade 1.30
#2299
kannon92
closed
1 week ago
10
KEP-2170: Add the TrainJob state transition design
#2298
tenzen-y
closed
3 weeks ago
7
KEP-2170: Replace UPSERT operation for the objects with SSA PATCH
#2297
tenzen-y
opened
1 month ago
7
KEP-2170: Decouple JobSet from TrainJob
#2296
tenzen-y
closed
1 month ago
5
KEP-2170: Implement TrainJob Reconciler to manage objects
#2295
tenzen-y
closed
1 month ago
6
Upgrade Deepspeed demo dependencies
#2294
Syulin7
closed
1 month ago
3
Add strict validation on error messages in tests for v2 APIs
#2293
akshaychitneni
closed
1 month ago
3
Adapt the manifests to kustomize v5
#2292
tenzen-y
closed
1 day ago
4
Next