issues
search
kubeflow
/
training-operator
Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.58k
stars
687
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
KEP-2170: Add TrainJob and TrainingRuntime APIs
#2223
andreyvelich
closed
1 month ago
15
KEP-2170: Bind repository into the build environment instead of filecopy
#2222
tenzen-y
closed
1 month ago
2
KEP-2170: Add directories for the V2 APIs
#2221
andreyvelich
closed
1 month ago
4
Regarding whether the tf-job-operator v1.0 metrics can expose specific failed pods
#2220
SecretSun
opened
1 month ago
2
KEP-2170: Implement validations for TrainingRuntime and ClusterTrainingRuntime
#2219
tenzen-y
opened
1 month ago
1
KEP-2170: Support the PodSpecOverrides API in TrainJob
#2218
andreyvelich
opened
1 month ago
0
KEP-2170: Create MPI Runtime
#2217
andreyvelich
opened
1 month ago
4
Design Kubeflow Python SDK for Training V2
#2216
andreyvelich
opened
1 month ago
0
KEP-2170: Generate OpenAPI spec for V2 APIs
#2215
andreyvelich
opened
1 month ago
0
KEP-2170: Update documentation for V2 APIs
#2214
andreyvelich
opened
1 month ago
0
KEP-2170: Add E2E tests for TrainJob
#2213
andreyvelich
opened
1 month ago
0
KEP-2170: Create LLM training runtime for Llama 2 7b
#2212
andreyvelich
opened
1 month ago
0
KEP-2170: Create PyTorch multi-node distributed training runtime
#2211
andreyvelich
opened
1 month ago
1
KEP-2170: Create dataset and model initializers
#2210
andreyvelich
opened
1 month ago
0
KEP-2170: Implement validations for TrainJob
#2209
andreyvelich
opened
1 month ago
3
KEP-2170: Create Kustomize manifests to deploy JobSet and TrainJob controllers
#2208
andreyvelich
opened
1 month ago
0
KEP-2170: Create controller for TrainJob
#2207
andreyvelich
opened
1 month ago
1
KEP-2170: Add APIs for TrainJob and TrainingRuntime
#2206
andreyvelich
closed
1 month ago
2
[SDK] test: add unit test for get_job method of the training_client
#2205
Bobbins228
closed
3 weeks ago
5
Back-off pulling image "alpine:3.10"
#2204
lizu18xz
opened
1 month ago
6
[Feature] Support managed by external controller
#2203
mszadkow
closed
1 week ago
19
[SDK] test: added unit tests for delete_job() method
#2202
Bobbins228
closed
1 month ago
5
KEP-2170: Add the apiGroup to the TrainingRuntimeRef
#2201
tenzen-y
closed
1 month ago
3
Why overwrite RestartPolicy in podTemplate
#2200
Bowser1704
closed
1 month ago
2
Add e2e test for train API
#2199
helenxie-bit
opened
1 month ago
16
KEP-2170: Make API specification more restricting
#2198
tenzen-y
closed
1 month ago
4
Add parameters to helper functions
#2197
helenxie-bit
closed
1 month ago
4
[SDK] Add UTs for `wait_for_job_conditions`
#2196
Electronic-Waste
closed
1 month ago
4
Enhance pre-commit hooks with flake8 linting
#2195
Ygnas
closed
1 month ago
6
Add JAX controller
#2194
sandipanpanda
closed
1 week ago
12
Add support for the `managedBy` field
#2193
mimowo
closed
1 week ago
17
[SDK] Unit tests for TrainingClient APIs - get_job_pod_names and update_job
#2192
YosiElias
closed
1 month ago
6
Add GitHub Actions Workflow for Python Code Formatting and Linting
#2191
Ygnas
closed
1 month ago
4
fix: incorrect initialize null replicaStatuses lead to update JobStat…
#2190
PeterChg
closed
2 months ago
4
fix: incorrect initialize null replicaStatuses lead to update JobStat…
#2189
PeterChg
closed
2 months ago
1
fix: incorrect initialize null replicaStatuses lead to update JobStat…
#2188
PeterChg
closed
2 months ago
1
Update the name of PVC in `train` API
#2187
helenxie-bit
closed
1 month ago
3
[SDK] Add e2e tests to fine-tune LLMs with `train` API
#2186
andreyvelich
opened
2 months ago
2
Training job restart enhancement
#2185
emeraldbay
opened
2 months ago
13
Implement pre-commit hooks
#2184
droctothorpe
closed
1 month ago
10
Consider container image rename of `kubeflow/storage-initializer`
#2183
tarilabs
opened
2 months ago
5
Support richer volcano scheduling
#2182
shaoqingyang
opened
2 months ago
2
Update trainer to ensure type consistency for `train_args` and `lora_config`
#2181
helenxie-bit
closed
1 month ago
12
Update `huggingface_hub` Version in the storage initializer to fix ImportError
#2180
helenxie-bit
closed
1 month ago
6
"ImportError" when running fine-tuning API
#2179
helenxie-bit
closed
1 month ago
0
Enable pre-commit for repo
#2178
droctothorpe
closed
1 month ago
4
Migrate to controller-runtime logger in mpi job controller
#2177
champon1020
opened
2 months ago
2
Encountered an error while running the example in the document train_api_hf_dataset
#2176
cho-chem
opened
2 months ago
6
[SDK] Add more unit tests for TrainingClient APIs - get_job_pods
#2175
YosiElias
closed
2 months ago
5
Unit test training client get job pods
#2174
YosiElias
closed
2 months ago
1
Previous
Next