issues
search
pytorch
/
elastic
PyTorch elastic training
BSD 3-Clause "New" or "Revised" License
730
stars
98
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
[examples/imagenet/main.py] Why doesn't elastic code contain gpu sync to compute performance, e.g. all_reduce
#170
aitrics-chris
opened
2 years ago
0
RuntimeError: Expected all tensors to be on the same device, but found at least two devices
#169
aitrics-chris
closed
2 years ago
4
Please add more torch elastic training examples
#168
wanziyu
opened
2 years ago
0
(torchelastic) update README to point elastic CRD users to TorchX
#167
kiukchung
closed
2 years ago
1
BASE_IMG upgrade for Dockerfile after PyTorch1.10
#166
ghost
closed
2 years ago
3
rendezvous: _matches_machine_hostname doesn't resolve hostnames fully
#165
d4l3k
opened
2 years ago
2
Kubernetes: ttlSecondsAfterFinished not working in ElasticJob spec
#164
jovan-absci
opened
2 years ago
0
(torchx/specs) Remove RunConfig in favor of using Dict[str, CfgVal] directly
#163
kiukchung
closed
2 years ago
6
[feature request] Add CPU example
#162
gaocegege
opened
3 years ago
2
Remove unconfigured submodule
#161
jonathan-conder-sm
opened
3 years ago
3
Is petctl also deprecated?
#160
vadimkantorov
opened
3 years ago
0
[feature request] petctl to support pulling script directory from github repo by commit or tag
#159
vadimkantorov
opened
3 years ago
0
submodule path docs/src/pytorch-sphinx-theme not in .gitmodules
#158
jonathan-conder-sm
opened
3 years ago
0
Kubernetes CustomResourceDefinition Moving out of Beta
#157
5had3z
closed
3 years ago
4
[Blocked] feat(dockerfile): Use Torch 1.9 instead of nightly
#156
gaocegege
closed
3 years ago
4
Various improvements to `torch.distributed.launch` and `torch.distributed.run` (#60925)
#155
aivanou
closed
3 years ago
2
update docs to point to pytorch 1.9 and torchx for torchelastic and tsm (respectively)
#154
kiukchung
closed
3 years ago
2
Update index.md
#153
brianjo
closed
3 years ago
2
EtcdStore: AttributeError: can't set attribute
#152
vv-p
opened
3 years ago
1
Cannot reuse --rdzv_id between different elastic launch ?
#151
PKUFlyingPig
opened
3 years ago
0
Imagenet example fails during accuracy calculation (v0.2.2 on 1.8.1)
#150
assapin
closed
3 years ago
1
add support for jetter to Role (base_image) for mast launches
#149
kiukchung
closed
3 years ago
2
Move torchelastic docs *.rst (#56811)
#148
kiukchung
closed
3 years ago
2
Out of Data documentation
#147
Godricly
closed
3 years ago
4
Improve the implementation of `RendezvousParameters` and add its unit tests. (#54807)
#146
cbalioglu
closed
3 years ago
2
ModuleNotFoundError: No module named 'torch.distributed.elastic'
#145
GwangsooHong
closed
3 years ago
4
Fix python required version > 3.6 bug
#144
jenhaoyang
opened
3 years ago
8
Move torchelastic/events to torch/distributed/events
#143
aivanou
closed
3 years ago
2
Support PyTorch 1.8, TorchVision 0.9.0 and TorchAduio 0.8.0
#142
davidspek
closed
3 years ago
7
Move torchelastic/rendezvous to torch/distributed/rendezvous
#141
kiukchung
closed
3 years ago
2
Torch Elastic - How to make sure all nodes are in the same AZ?
#140
thecooltechguy
closed
3 years ago
2
[*.py] Rename "Arguments:" to "Args:"
#139
SamuelMarks
closed
3 years ago
3
Remove NCCL Blocking Wait from Imagenet Example
#138
osalpekar
closed
3 years ago
2
Minor fix in Issue Reporting Template
#137
osalpekar
closed
3 years ago
3
Enable NCCL_ASYNC_ERROR_HANDLING in Torchelastic
#136
osalpekar
closed
3 years ago
1
Pytorch Lightning with TorchElastic - One worker doesn't start
#135
tchaton
closed
3 years ago
3
Elastic agent doesn't detect worker failures in NCCL
#134
ruipeterpan
closed
3 years ago
4
Enable NCCL_ASYNC_ERROR_HANDLING in torchelastic
#133
osalpekar
closed
3 years ago
4
Fix circle CI breakage by depending on torch-1.8.0dev (nightly)
#132
kiukchung
closed
3 years ago
2
unbind scheduler from session and make session apis take a scheduler backend
#131
kiukchung
closed
4 years ago
4
How to programmatically determine if a training job has finished using `kubectl`?
#130
darthsuogles
closed
4 years ago
2
make torchelastic.distributed.launch args settable from env var with name PET_ARG
#129
kiukchung
closed
4 years ago
2
Add env support for the training script argument
#128
kuikuikuizzZ
closed
4 years ago
4
Add tsm docs to the docs page, added --dry-run flag to doc_push.sh, fix a few docstring typos
#127
kiukchung
closed
4 years ago
2
add ui url to the return value of session.status(app_id)
#126
kiukchung
closed
4 years ago
2
add replica_id macro
#125
kiukchung
closed
4 years ago
2
accept role as a command line argument
#124
yifuwang
closed
4 years ago
2
implement ElasticRole, role args macro substitution
#123
kiukchung
closed
4 years ago
2
move pytorch/elastic/test/** into pytorch/elastic/torchelastic/**/test/**
#122
kiukchung
closed
4 years ago
4
test pyenv
#121
kiukchung
closed
4 years ago
0
Next