issues
search
kubeflow
/
training-operator
Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.62k
stars
700
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Add environment variables to containers
#2284
tarekabouzeid
opened
1 month ago
0
KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API
#2283
andreyvelich
closed
1 month ago
4
How can I change the default MASTER_PORT in Pytorchjob?
#2282
certainly-cyber
closed
1 month ago
2
[v2alpha] Move GV related codebase
#2281
varshaprasad96
closed
1 month ago
2
KEP-2170: Migrate the container resource calculation mechanism to k/k library
#2280
tenzen-y
opened
1 month ago
7
Document the spec.managedBy field and its use for MultiKueue
#2279
mimowo
opened
1 month ago
2
Training Operator crashes when submitting PyTorchJob with elasticPolicy but without worker template defined
#2278
alenawang
opened
1 month ago
2
PET_NNODES env var for PyTorchJobs is incorrect when elasticPolicy is set
#2277
alenawang
opened
1 month ago
1
[SDK] Use torchrun to create PyTorchJob from function
#2276
andreyvelich
closed
1 month ago
3
[SDK] test: add unit test for get_job_logs method of the training_client
#2275
seanlaii
closed
1 month ago
5
Added test for create-pytorchjob.ipynb python notebook
#2274
saileshd1402
opened
1 month ago
3
KEP-2170: Generate clientset, openapi spec for the V2 APIs
#2273
varshaprasad96
closed
1 month ago
2
Engineering
#2272
TowneMi
closed
2 months ago
0
Integration tests
#2271
oksanabaza
closed
2 months ago
1
Update tf job examples to tf v2
#2270
YosiElias
closed
3 weeks ago
7
How to restart the training JOB when one training process fails in cluster environment to recover the training?
#2269
kevinsummer219
opened
2 months ago
4
[SDK] move env var to constants.py
#2268
varshaprasad96
closed
2 months ago
2
[SDK] test: add unit test for list_jobs method of the training_client
#2267
seanlaii
closed
1 month ago
5
Problem with "pytorch-dist-mnist-test:v1.0" image in example notebook "create-pytorchjob.ipynb"
#2266
saileshd1402
opened
2 months ago
1
[SDK] Minor fix in wait_for_job_conditions with job_kind python training API
#2265
saileshd1402
closed
2 months ago
2
Update JAX image to use image published by Kubeflow
#2264
sandipanpanda
closed
2 months ago
4
Add helm charts for training operator
#2263
ChenYi015
opened
2 months ago
9
[SDK] Allow customising base trainer and storage images in Train API
#2261
varshaprasad96
closed
2 months ago
4
KEP-2170: Adding CEL validations on v2 TrainJob CRD
#2260
akshaychitneni
closed
1 month ago
10
Training Operator ROADMAP 2024
#2259
andreyvelich
opened
2 months ago
5
Add Changelog for Training Operator v1.8.1
#2258
andreyvelich
closed
2 months ago
2
Bump Training Python SDK to 1.8.1 version
#2257
andreyvelich
closed
2 months ago
2
Release Training SDK 1.8.1
#2256
andreyvelich
closed
2 months ago
2
[SDK] Read namespace from the current context
#2255
andreyvelich
closed
2 months ago
2
Update Prometheus monitoring docs for Training Operator
#2254
andreyvelich
closed
1 month ago
6
[SDK] Training Client Conditions related unit tests
#2253
Bobbins228
closed
1 month ago
4
Update README and out-of-date docs
#2252
andreyvelich
closed
2 months ago
3
KEP-2170: Implement skeleton webhook servers
#2251
tenzen-y
closed
2 months ago
7
[SDK] Fix typo of "get_pvc_spec"
#2250
helenxie-bit
closed
2 months ago
3
Create Slurm runtime for model training using V2 APIs
#2249
andreyvelich
opened
2 months ago
1
KEP-2170: Implement runtime framework
#2248
tenzen-y
closed
1 month ago
9
[SDK] Issues with trying to use train API with TinyLlama LLM
#2247
varshaprasad96
closed
2 months ago
7
[Test] E2e Tests for Notebook Examples
#2246
Electronic-Waste
opened
2 months ago
9
KEP-2170: Create model exporter for checkpointing and training output
#2245
andreyvelich
opened
2 months ago
1
Release-1.8: Cherry-pick of #2243
#2244
tenzen-y
closed
2 months ago
3
[Bug] Finish CleanupJob early if the job is suspended.
#2243
mszadkow
closed
2 months ago
5
Cherry pick of #2180 #2230 into v1.8-branch
#2242
andreyvelich
closed
2 months ago
3
[Release] Training operator 1.8.1 release
#2241
tenzen-y
closed
2 months ago
8
KEP-2170: Update Training V2 APIs in the KEP
#2240
andreyvelich
closed
2 months ago
8
Broken preemption on TFJob with non default runPolicy.ttlSecondsAfterFinished
#2239
mszadkow
closed
2 months ago
13
Clean up Go modules
#2238
tenzen-y
closed
3 months ago
3
KEP-2170: Generate CRD manifests for v2 CustomResources
#2237
tenzen-y
closed
2 months ago
11
KEP-2170: Initial Implementations for v2 Manager
#2236
tenzen-y
closed
2 months ago
6
Add DeepSpeed Example with Pytorch Operator
#2235
Syulin7
closed
1 month ago
7
Change isort profile to black for full compatibility
#2234
Ygnas
closed
3 months ago
2
Previous
Next