issues
search
kubeflow
/
pytorch-operator
PyTorch on Kubernetes
Apache License 2.0
306
stars
143
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
add dependabot config script
#315
davidspek
opened
3 years ago
4
Please create v1.2-branch
#314
SatwikBhandiwad
closed
3 years ago
3
dist.init_process_group stuck
#313
ravenj73
opened
3 years ago
9
kubeflow pipelines sdk, distributed multi-node training with autoscaling
#312
rami3e
closed
3 years ago
4
Does pytorch-opterator just simplified the use of nn.parallel.DistributedDataParallel on multi nodes of multi gpu?
#311
lwj1980s
closed
3 years ago
2
can I use gpus on specific node to train
#310
lwj1980s
closed
3 years ago
5
Add @andreyvelich to approvers
#309
andreyvelich
closed
3 years ago
2
Reuse Common Scripts for Creating / Deleting EKS clusters
#308
PatrickXYS
closed
3 years ago
6
Do not trigger presubmit jobs for simple changes
#307
Jeffwan
opened
3 years ago
1
Add Jeffwan@ to OWNERS
#306
Jeffwan
closed
3 years ago
11
Move PyTorch Operator e2e tests to AWS Prow
#305
Jeffwan
closed
3 years ago
35
how can I run a pytorch job with all my Gpu resources
#304
lwj1980s
closed
3 years ago
4
Add test friendly manifests
#303
Jeffwan
closed
3 years ago
6
Make manifest test friendly
#302
Jeffwan
closed
3 years ago
2
Support manifest on Kubernetes 1.16+
#301
Jeffwan
closed
3 years ago
6
Updated the image name format for the gcr.io.
#300
wuchen03
opened
4 years ago
11
Activate Travis in PR check
#299
andreyvelich
opened
4 years ago
2
Change cluster version to 1.16 for e2e test
#298
andreyvelich
closed
4 years ago
2
Test webhook
#297
Jeffwan
closed
4 years ago
1
Support Torch Elastic in pytorch operator
#296
Jeffwan
opened
4 years ago
2
update pytorch-operator deployment manifests file
#295
myonlyzzy
closed
3 years ago
15
pytorch-operator pod CheckCRDExist failed
#294
myonlyzzy
closed
3 years ago
3
Fix Unit Tests
#293
andreyvelich
closed
3 years ago
24
[bug] Unit test is broken
#292
gaocegege
opened
4 years ago
4
'./pytorch_job_sendrecv.yaml' missing in pytorch-operator/examples/smoke-dist
#291
Lyken17
closed
3 years ago
6
Update README.md
#290
pingsutw
closed
4 years ago
2
Update CRD link
#289
pingsutw
closed
4 years ago
1
support cleanPodPolicy is Running, same as tf operator
#288
jiaqianjing
closed
4 years ago
10
how to create a local non-distributed training
#287
houz42
closed
4 years ago
7
chore: Update OWNERS
#286
gaocegege
closed
4 years ago
2
Adds notes and example annotation for pytorch job
#285
shawnzhu
closed
4 years ago
3
PyTorchJob CRD definition link is broken
#284
sakaia
closed
3 years ago
2
Do we need pod name and namespace in manifests?
#283
gaocegege
opened
4 years ago
2
Migrate code implementation to kubeflow/common fashion
#282
Jeffwan
opened
4 years ago
3
Where are the pytorch-crd and pytorch-operator YAML files?
#281
g-karthik
closed
4 years ago
8
Update swagger-codegen-cli URL
#280
jinchihe
closed
4 years ago
1
Why worker has init container wait for master ready?
#279
jiaqianjing
opened
4 years ago
3
How to run single-machine job?
#278
jiaqianjing
closed
4 years ago
10
fix Dockerfile-mpi download miniconda.sh
#277
jiaqianjing
closed
4 years ago
9
Fix minor OpenShift issues - resource requests, Dockerfile
#276
vpavlin
closed
4 years ago
2
OCI Runtime error for init-pytorch on AKS
#275
wangdian
closed
4 years ago
3
Update openapi-gen to not rely on vendor
#274
Jeffwan
closed
4 years ago
7
Cut release for pytorch operator
#273
Jeffwan
opened
4 years ago
4
Migrate pytorch-operator to go modules
#272
Jeffwan
closed
4 years ago
3
Distributed mnist is unexpectedly slow
#271
panchul
opened
4 years ago
7
[feature] Rethink distributed Pytorch backoff retry
#270
czheng94
opened
4 years ago
6
[examples/mnist]README.md instruction should be modified
#269
sakaia
opened
4 years ago
1
[examples/smoke_dist] pytorch_job_sendrecv.yaml does not exist in the directory
#268
sakaia
opened
4 years ago
1
PyTorch Operator recognizes kubernetes cluster like single machine?
#267
aheeruru
closed
4 years ago
2
Fix the link to run_e2e_workflow.py script
#266
terrytangyuan
closed
4 years ago
2
Previous
Next