issues
search
intelligent-machine-learning
/
dlrover
DLRover: An Automatic Distributed Deep Learning System
Other
1.22k
stars
153
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Compatible torch 2.4.x (0926)
#1282
BalaBalaYi
closed
21 hours ago
1
Upgrade version 0.3.8
#1281
BalaBalaYi
closed
2 days ago
1
upgrade version 0.3.8
#1280
BalaBalaYi
closed
2 days ago
1
Compatible with torch2.4
#1279
BalaBalaYi
closed
1 day ago
1
Enlarge heartbeat timeout
#1278
BalaBalaYi
closed
5 days ago
1
Worker get elastic run config from master
#1277
samplise
opened
5 days ago
3
Optimize logging
#1276
BalaBalaYi
closed
1 week ago
1
DLRover - Flyte integration
#1275
davidmirror-ops
opened
1 week ago
2
add RELEASES.md MAINTAINERS.md CONTRIBUTING.md CODE_OF_CONDUCT.md
#1274
majieyue
closed
1 week ago
1
Fix bug.
#1273
BalaBalaYi
closed
1 week ago
0
Filter error code log
#1272
samplise
closed
2 weeks ago
1
Skip empty error codes when check node failures
#1271
samplise
closed
2 weeks ago
1
Add ut for log collecting.
#1270
BalaBalaYi
closed
2 weeks ago
1
Filter non training logs
#1269
samplise
closed
2 weeks ago
1
Make diagnosis agent singleton
#1268
samplise
closed
1 week ago
1
Fix diagnosis configure bugs
#1267
samplise
closed
2 weeks ago
1
missing elastic_training_pb2
#1266
NiushanDong
opened
2 weeks ago
1
optimize check abnormal nodes
#1265
BalaBalaYi
closed
2 weeks ago
1
optimize heartbeat collect
#1264
BalaBalaYi
closed
2 weeks ago
1
Flash checkpoint does not support safetensors
#1263
Alex-Ruan
opened
2 weeks ago
0
optimize diagnose logging
#1262
BalaBalaYi
closed
2 weeks ago
1
fix socket error
#1261
BalaBalaYi
closed
2 weeks ago
1
Erros in dlrover, after pip installed the dlrover package
#1260
Desperadoze
opened
2 weeks ago
2
Fix duplicate pod relaunching for some cases(with internal k8s).
#1259
BalaBalaYi
closed
3 weeks ago
1
Optimize pending timeout using
#1258
BalaBalaYi
closed
3 weeks ago
1
Optimize and fix events expose
#1257
samplise
closed
3 weeks ago
1
deepspeed zero3 also save ckpt only in rank 0?
#1256
Alex-Ruan
closed
2 weeks ago
1
Skip pending timeout when timeout=0.
#1255
BalaBalaYi
closed
1 month ago
1
Fix serveral issue when using fsdp checkpointer.
#1254
BalaBalaYi
closed
1 month ago
1
Optimize network-check.
#1253
BalaBalaYi
closed
1 month ago
1
Optimize training ending
#1252
BalaBalaYi
closed
1 month ago
1
Fix path creation in fsdp dcp saver.
#1251
BalaBalaYi
closed
1 month ago
1
Sse rdzv timeout as insufficient timeout
#1250
BalaBalaYi
closed
1 month ago
1
Fix rdzv updating in concurrency
#1249
BalaBalaYi
closed
1 month ago
1
Can you create a dlrover arm64 image for Ascend NPU?
#1248
xmarker
opened
1 month ago
1
Add alive pod stats variable.
#1247
BalaBalaYi
closed
1 month ago
1
Optimize logging and revert using random to create socket
#1246
BalaBalaYi
closed
1 month ago
1
Revert "【WIP】Temp solution for socket conflict."
#1245
BalaBalaYi
closed
1 month ago
1
Question: How DLRover integrate with Llama Factory?
#1244
hetingyou
opened
1 month ago
1
What is the relationship with DLRover and Megatron? Can I integrate DLRover with Megatron with fault-tolerance and monitoring capabilities. How DLRover can recover from GPU offline problems with TP and PP needing to be reorganized?
#1243
dotsonliu
opened
1 month ago
1
Add validation for 'critical_worker_index'
#1242
BalaBalaYi
closed
1 month ago
1
Update dignding
#1241
BalaBalaYi
closed
1 month ago
1
update ding group
#1240
BalaBalaYi
closed
1 month ago
1
Fix type.
#1239
BalaBalaYi
closed
1 month ago
1
Skip 'should early stop' for non all reduce job.
#1238
BalaBalaYi
closed
1 month ago
1
Remove error code 128 from 'hardware-error'
#1237
BalaBalaYi
closed
1 month ago
1
Fix master client setup in ckpt saver.
#1236
BalaBalaYi
closed
1 month ago
1
Optimize ckeckpointing.
#1235
BalaBalaYi
closed
1 month ago
0
Refactor diagnose agent
#1234
samplise
closed
1 month ago
1
while using megatron distributed flash-checkpoint to recovery, error ocurs when load_checkpoint
#1233
deepcoldfish
opened
1 month ago
0
Next