issues
search
petuum
/
adaptdl
Resource-adaptive cluster scheduler for deep learning training.
https://adaptdl.readthedocs.io/
Apache License 2.0
422
stars
76
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Calculating GNS for other optimizers
#143
HariSeldon11988
opened
3 weeks ago
0
ModuleNotFoundError: mitmproxy.proxy.config
#142
selinnilesy
opened
5 months ago
0
GPU throughput data of different models and different numbers
#141
kkzzzz0922
opened
10 months ago
0
Add adaptdl delete cmd
#140
hliangzhao
opened
1 year ago
0
hello_world submitted by adaptdl cannot go into `Running` state in Docker Desktop for Mac
#139
hliangzhao
closed
1 year ago
1
submit hello_world occurs "ImagePullBackOff "
#138
Tweakzx
opened
1 year ago
3
Cann't access to tensorboard when mnist_tensorboard.py is running
#137
xlcbingo1999
opened
1 year ago
0
Version of Pytorch and Cuda
#136
yuxiangwei0808
opened
1 year ago
0
Keyerror occurs when auto-scaling happens in AdpatDL scheduler
#135
yuxiangwei0808
opened
2 years ago
0
Fix eksctl clusterconfig
#134
odp
closed
2 years ago
1
Problem when provision EKS cluster
#133
jason524w
closed
2 years ago
0
fix readthedocs config, part 2
#132
rmfan
closed
2 years ago
1
Fix readthedocs config
#131
rmfan
closed
2 years ago
0
Resolve new flake8 errors
#130
odp
closed
2 years ago
0
Bugfixes from Fairseq integration
#129
odp
closed
2 years ago
1
what does _get_cluster_sizes function mean
#128
tingshua-yts
opened
2 years ago
1
Improve documentation for adaptdl ray-aws
#127
rmfan
closed
2 years ago
1
Make `from_ray` True only for Tune scheduler
#126
odp
closed
2 years ago
1
Strange outputs when running dcgan example
#125
zxmeng98
opened
2 years ago
0
Problem when installing adaptdl scheduler
#124
gudiandian
closed
1 year ago
9
Stage1.5
#123
Xuezhi-Liang
opened
2 years ago
0
Progress in validation
#122
Rivendile
closed
2 years ago
0
A few problems when reproducing the benchmark
#121
gudiandian
closed
2 years ago
4
Large system overheads of AdaptDL
#120
gudiandian
closed
2 years ago
6
Fix adaptdl-ray release version
#119
odp
closed
2 years ago
1
Support apiextensions.k8s.io/v1 and admissionregistration.k8s.io/v1
#118
odp
closed
2 years ago
1
Integrating with PyTorch Lightning
#117
jaywonchung
opened
2 years ago
4
Handle default case of spec.preemptible
#116
odp
closed
2 years ago
0
Stage1
#115
Xuezhi-Liang
closed
2 years ago
1
Disable immediate allocation for NP jobs
#114
odp
closed
2 years ago
1
Use empty string for all-inclusive pod-label-selector
#113
odp
closed
2 years ago
0
hello_world can not run
#112
czq693497091
opened
2 years ago
7
Add adaptdl ray to index
#111
rmfan
closed
2 years ago
1
[Pollux, Reproducibility, Inquiry] Are dataset-fetching mechanisms broken?
#110
stet-stet
opened
2 years ago
3
Fix documentation
#109
rmfan
closed
2 years ago
0
Fix the ray links in documentation
#108
rmfan
closed
2 years ago
1
Use Ray 1.9 (internal) API changes
#107
odp
closed
2 years ago
1
Problems encountered during the installation of AdaptDL Helm Chart
#106
prz30
closed
2 years ago
2
Running the AdaptDL training process as something other than Process 1 causes checkpointing to fail.
#105
rmfan
opened
2 years ago
0
The meaning of progress
#104
gaow0007
closed
2 years ago
5
Upgrade pymoo to 0.5.0
#103
odp
closed
2 years ago
1
Add support to run an adaptdl job on a ray aws cluster
#102
rmfan
closed
2 years ago
2
Adaptive Tune Trial Scheduler
#101
odp
closed
2 years ago
3
"CUDA error: invalid resource handle" on Standalone Training
#100
HyeonchanKim
closed
2 years ago
1
Add supports to iterable-style datasets in adaptdl.torch.AdaptiveDataLoader
#99
mylibrar
opened
2 years ago
0
Adaptive Batch Size for Single-GPU training
#98
gaow0007
closed
3 years ago
7
Confusion about Distributed Training
#97
gaow0007
closed
3 years ago
2
Benchmark Dataset for DeepSpeech2 in Pollux
#96
lynnliu030
closed
3 years ago
3
Adascale with Adam
#95
rmfan
closed
3 years ago
1
Print exceptions for torch hooks and callbacks
#94
aurickq
closed
3 years ago
1
Next