issues
search
kubeflow
/
pytorch-operator
PyTorch on Kubernetes
Apache License 2.0
307
stars
143
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
unable to build image for ppc64le
#365
gajanankulkarni-18
opened
3 years ago
0
PytorchJob DDP training will stop if I delete a worker pod
#364
Shuai-Xie
opened
3 years ago
2
run https://github.com/kubeflow/pytorch-operator/blob/master/sdk/python/test/test_e2e.py failed
#363
sxl1993
opened
3 years ago
1
Multi-gpu in a single pod
#362
wallarug
opened
3 years ago
2
Add notice before archiving
#361
terrytangyuan
closed
3 years ago
2
service label mismatches selector, which result in inconsistency
#360
konnase
opened
3 years ago
3
The training hangs after reloading one of master/worker pods
#359
dmitsf
opened
3 years ago
5
Can not use volcano for Gang Scheduling
#358
bug-developer021
closed
3 years ago
0
support set volcano queue name
#357
qiankunli
opened
3 years ago
2
Can I freeze pytorchjob training pods and migrate them to other nodes?
#356
Shuai-Xie
opened
3 years ago
9
Pytorch version may have an effect on the training reproduction
#355
Shuai-Xie
opened
3 years ago
4
Different DDP training results of PytorchJob and Bare Metal
#354
Shuai-Xie
opened
3 years ago
6
Can I use hostNetwork to run PytorchJob like on bare metal
#353
Shuai-Xie
closed
3 years ago
3
Can PytorchJob skip or cancel the init cantainer?
#352
SeibertronSS
opened
3 years ago
2
volcano change the PodGroup CRD APIGroup to volcano.sh
#351
qiankunli
opened
3 years ago
1
How to use DDP in pytorch operator?
#350
SeibertronSS
closed
3 years ago
3
why worker need initContainer in pytorch-operator?
#349
zqz-net
closed
3 years ago
2
container "pytorch" is waiting to start: PodInitializing
#348
gogogwwb
opened
3 years ago
20
Upgrade to v1 CRDs
#347
mcristina422
opened
3 years ago
1
[feat] Support PyTorch 1.9
#346
gaocegege
opened
3 years ago
3
Fix: Change PTL to release version
#345
jagadeeshi2i
closed
3 years ago
2
PytorchJob replicas has different node affinity behaviors compared with Deployment
#344
Shuai-Xie
opened
3 years ago
4
Update the versions of common, tfjob and some other modules
#343
paipaoso
closed
3 years ago
7
What is the difference between master and worker?
#342
SeibertronSS
closed
3 years ago
6
Fix 'Invalid Pointer' error when PytorchJob is deleted
#341
alembiewski
closed
3 years ago
4
fell confused about world_size
#340
ldd91
closed
3 years ago
0
`init-pytorch` init container image configurable
#339
apatil4
closed
3 years ago
4
Add job namespace to `pytorch_operator_jobs_*` counters
#338
alembiewski
closed
3 years ago
4
Bert example with Pytorch Lightning
#337
jagadeeshi2i
closed
3 years ago
2
Adding example config file
#336
johnugeorge
closed
3 years ago
4
Worker template should be configurable.
#335
MartinForReal
opened
3 years ago
1
PyTorch Lightning Example.
#334
tchaton
closed
3 years ago
0
'host not found' error occurs during PyTorch distributed learning
#333
JGoo1
opened
3 years ago
1
NCCL "Connection Refused" for Worker Pods
#332
twolffpiggott
opened
3 years ago
1
whether multi-gpu-per-pod setup be supported in PytorchJob
#331
tingweiwu
opened
3 years ago
1
can I use PyTorchJobClient inside a pod of the cluster?
#330
omlomloml
opened
3 years ago
1
worker get connection timed out error in user namespace with sidecar.istio.io/inject=false
#329
tingweiwu
closed
3 years ago
1
is there a simpler way to install pytorch-operator
#328
tingweiwu
closed
3 years ago
2
Change mnist example to use FashionMNIST
#327
Jeffwan
closed
3 years ago
2
Temporarily disable mnist test case
#326
Jeffwan
closed
3 years ago
3
Mnist dataset server is down
#325
Jeffwan
opened
3 years ago
5
[DO NOT MERGE] Change to test CI
#324
yanniszark
closed
3 years ago
4
pytorch-operator: Consolidate manifests
#323
yanniszark
closed
3 years ago
7
pytorch-operator: Consolidate manifests
#322
yanniszark
closed
3 years ago
1
Operator has invalid memory address error on specific pytorchjob spec
#321
ca-scribner
opened
3 years ago
1
PyTorch Operator: Move manifests development upstream
#320
yanniszark
closed
3 years ago
4
Unlable to spawn PyTorchJob due to image alpine dependency of pytorch-operator
#319
asahalyft
opened
3 years ago
4
PyTorch Operator: Move manifests development upstream
#318
yanniszark
closed
3 years ago
0
Is python sdk still being maintained?
#317
ca-scribner
opened
3 years ago
7
Migrate to new test-infra
#316
PatrickXYS
closed
3 years ago
36
Next