issues
search
intelligent-machine-learning
/
dlrover
DLRover: An Automatic Distributed Deep Learning System
Other
1.22k
stars
153
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
add logging for writing error
#1232
BalaBalaYi
closed
1 month ago
1
Update dlrover event action
#1231
samplise
closed
1 month ago
1
Fix fsdp dcp saver
#1230
BalaBalaYi
closed
1 month ago
0
Fix flash ckpt
#1229
BalaBalaYi
closed
1 month ago
1
Optimize diagnosis structure
#1228
BalaBalaYi
closed
1 month ago
1
【WIP】Temp solution for socket conflict.
#1227
BalaBalaYi
closed
1 month ago
1
fix unittest error: AttributeError: ElasticLaunchConfig object has no attribute tee
#1226
majieyue
closed
1 month ago
1
Why model_optim_rng.pt is saved in a seperate directory?
#1225
zhaoyang-star
opened
1 month ago
7
Optimize test for ascend NPU.
#1224
BalaBalaYi
closed
1 month ago
1
Why model_optim_rng.pt is not saved when enable dlrover?
#1223
zhaoyang-star
closed
1 month ago
0
easydl/elasticjob-controller:master image pull error
#1222
xywangbuaa
opened
1 month ago
1
transformers version?
#1221
Alex-Ruan
closed
1 month ago
1
Optimize pending judgement: when all nodes pending
#1220
BalaBalaYi
closed
1 month ago
1
【WIP】add pod diagnosis feature
#1219
xiaochaoren
opened
2 months ago
1
remove xpu-timer
#1218
BalaBalaYi
closed
2 months ago
3
Fix scaler async execution.
#1217
BalaBalaYi
closed
2 months ago
1
Optimize logging in rdzf manager.
#1216
BalaBalaYi
closed
2 months ago
1
scale down allreduct pytorch job won't complete and report error
#1215
cocodee
opened
2 months ago
1
Copy tensor to the shared memory without grad.
#1214
workingloong
closed
2 months ago
1
Add signal timeout for 'stop_workers'
#1213
BalaBalaYi
closed
2 months ago
2
Support action timeout processing.
#1212
BalaBalaYi
closed
2 months ago
1
Fix user-agent issue.
#1211
BalaBalaYi
closed
2 months ago
1
Resolve pending and insufficient nodes issue.
#1210
BalaBalaYi
closed
2 months ago
1
Keep conflict processing when env set
#1209
BalaBalaYi
closed
2 months ago
1
When performing multi-node, multi-GPU training with Megatron-LM, if the 'rank' is only input in the startup script and not set in the environment variables, an exception may occur (stroagetype is disk)
#1208
lkq51
opened
2 months ago
2
Optimize node status from pod phase.
#1207
BalaBalaYi
closed
2 months ago
0
Support optional debug log level.
#1206
BalaBalaYi
closed
2 months ago
1
Enhancement of hccl port config resolution.
#1205
BalaBalaYi
closed
2 months ago
1
Increase heartbeat timeout
#1204
BalaBalaYi
closed
2 months ago
1
Expose important event
#1203
samplise
closed
1 month ago
1
Add default owners.
#1202
BalaBalaYi
closed
2 months ago
1
add more network check log
#1201
alpha-baby
closed
2 months ago
2
Fix heartbeat when there is node relaunched.
#1200
BalaBalaYi
closed
2 months ago
1
[Error] When using deepspeed to start a megatron training task, only rank 0 of the flash checkpoint saves the model
#1199
liangxuZhang
opened
2 months ago
1
Add try except for getting dead node.
#1198
BalaBalaYi
closed
2 months ago
0
fix time type issue
#1197
BalaBalaYi
closed
2 months ago
1
Add failure reporting for async ckpt saver.
#1196
BalaBalaYi
closed
2 months ago
2
What's the difference between MegatronCheckpointEngine and MegatronDistCheckpointEngine?
#1195
liangxuZhang
closed
2 months ago
0
optimize heartbeat logging
#1194
BalaBalaYi
closed
2 months ago
1
optimize elastic-run logging
#1193
BalaBalaYi
closed
2 months ago
1
Fix log file bugs
#1192
samplise
closed
2 months ago
0
Optimize hccl port detection
#1191
samplise
closed
2 months ago
1
Optimize failure node detection
#1190
samplise
closed
2 months ago
1
Fix heart beat for concurency.
#1189
BalaBalaYi
closed
2 months ago
2
Add std version output for agent.
#1188
BalaBalaYi
closed
2 months ago
1
Why checkpoint can't be copied to shared memory Asynchronously to shared memory when using Flash Checkpoint?
#1187
Reflect0
closed
2 months ago
1
fix exception when plan is none
#1186
BalaBalaYi
closed
2 months ago
1
Skip restart training process on failure nodes
#1185
samplise
closed
2 months ago
1
Unify job manager's stop status field
#1184
BalaBalaYi
closed
2 months ago
1
Sync internal modification.
#1183
BalaBalaYi
closed
2 months ago
1
Previous
Next