issues
search
intelligent-machine-learning
/
dlrover
DLRover: An Automatic Distributed Deep Learning System
Other
1.27k
stars
167
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Skip memory limitation for gpu type node relaunch operation.
#1341
BalaBalaYi
opened
2 hours ago
1
fix a bug in infer method
#1340
jlsong01
opened
1 day ago
1
Fix the issue that len(indices) and num_samples might not be equal
#1339
sunjq1
opened
2 days ago
0
dlorver适配新的加速器类型以及实现类似Nvidia_gpu.py脚本
#1338
lulu-0126
opened
2 days ago
1
client.connect(path) error when saving checkpoint
#1337
atomrun39
opened
2 days ago
1
Update build_proto.sh to use pip3 instead of pip
#1336
jinqinn
closed
57 minutes ago
1
WIP: handle GPU lost in resource monitor
#1335
samplise
opened
3 days ago
1
Fix diagnosis agent action consuming
#1334
BalaBalaYi
closed
3 days ago
1
make deploy 镜像拉取失败
#1333
Ind1x1
closed
5 days ago
0
AttributeError: module 'collections' has no attribute 'Sequence'
#1332
linzhidao1010
opened
5 days ago
0
fix process leak in ascend npu
#1331
majieyue
closed
3 days ago
1
Delete pod when list pod already succeeded.
#1330
BalaBalaYi
closed
5 days ago
1
Skip heartbeat timeout for failed and exited worker
#1329
BalaBalaYi
closed
5 days ago
1
Function using/naming optimized based on job context.
#1328
BalaBalaYi
closed
1 week ago
1
Improve job context function using.
#1327
BalaBalaYi
closed
1 week ago
1
Fix known issue of job context using.
#1326
BalaBalaYi
closed
1 week ago
1
Fix proto install script.
#1325
BalaBalaYi
closed
1 week ago
1
some refinement on code comment
#1324
jlsong01
closed
1 week ago
1
Job exit when all nodecheck failed
#1323
majieyue
closed
1 week ago
1
feat: generate pyi files for protobuf definitions
#1322
Peefy
opened
1 week ago
0
Expose ckpt events
#1321
samplise
opened
1 week ago
1
fix: typo RayJobSubmitter in ray_job_submitter.py
#1320
Peefy
closed
1 week ago
1
Job context implementation
#1319
samplise
closed
1 week ago
1
Refactor diagnosis manager
#1318
samplise
closed
5 days ago
1
Fix empty node issue after master failover
#1317
BalaBalaYi
closed
1 week ago
1
Revert grpc envs setting.
#1316
BalaBalaYi
closed
1 week ago
1
Refactor node event report and report succeeded.
#1315
BalaBalaYi
closed
2 weeks ago
1
Could DLRover be able to apply to the diffusion transformer training? And combined with deepspeed?
#1314
TomSuen
opened
2 weeks ago
1
fix a typo in user docs
#1313
taylor840326
closed
2 weeks ago
0
Fix: Use return instead of break to correctly exit loop on path exist…
#1312
jinqinn
closed
6 days ago
1
Add Balance Loss to MoE Example for Enhanced Expert Load Distribution (Issue #1300)
#1311
Mukku27
opened
3 weeks ago
2
The controller manager restarts frequently
#1310
sunjq1
opened
3 weeks ago
0
More logging on device using.
#1309
BalaBalaYi
closed
3 weeks ago
1
Set grpc env for optimization.
#1308
BalaBalaYi
closed
3 weeks ago
1
add multiple protobuf version support
#1307
majieyue
closed
2 weeks ago
3
Revert "Add grpc envs to improve stability."
#1306
BalaBalaYi
closed
3 weeks ago
1
Add grpc envs to improve stability.
#1305
BalaBalaYi
closed
3 weeks ago
1
update moe example
#1304
skydoorkai
closed
3 weeks ago
1
More pending fastfail stragegy.
#1303
BalaBalaYi
closed
3 weeks ago
1
[WIP] Refactor diagnosis manager
#1302
samplise
closed
1 week ago
1
Fix/Enhance node management.
#1301
BalaBalaYi
closed
3 weeks ago
1
Add balance loss in atorch moe example
#1300
skydoorkai
opened
1 month ago
0
Compatible torch 2.4(2.3) storage writer.
#1299
BalaBalaYi
closed
1 month ago
1
How does dlrover make sure all the nodes in one job are in one switch
#1298
gangxie112
opened
1 month ago
1
Fix issue in ckpt saver.
#1297
BalaBalaYi
closed
1 month ago
1
Fix relaunch node's relaunch limit.
#1296
BalaBalaYi
closed
1 month ago
1
Catch exception when get config from brain
#1295
samplise
closed
1 month ago
1
yaml lint
#1294
BalaBalaYi
closed
1 month ago
1
add exception handler in _get_master_addr_port since the port might b…
#1293
majieyue
closed
4 weeks ago
1
add stale issue worker flow
#1292
BalaBalaYi
closed
1 month ago
0
Next