issues
search
microsoft
/
DeepSpeed
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
33.6k
stars
3.94k
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
[BUG] Running llama2-7b step3 with tensor parallel and HE fails due to incompatible shapes
#5656
ShellyNR
opened
2 weeks ago
0
[BUG]模型卡在trainer.train()一直不训练
#5655
limllzu
closed
2 weeks ago
0
Fix latest pytorch '_get_socket_with_port' import error
#5654
Yejing-Lai
closed
1 week ago
6
[BUG] oneapi/ccl.hpp: No such file or directory.
#5653
weiji14
opened
2 weeks ago
1
Fix hpZ with zero element
#5652
samadejacobs
closed
1 week ago
0
Update version.txt after 0.14.3 release
#5651
mrwyattii
closed
2 weeks ago
0
Unpin transformers version
#5650
loadams
opened
2 weeks ago
0
Install issue with setuptools 70
#5649
myBigbug
closed
2 weeks ago
2
RuntimeError: still have inflight params[BUG]
#5648
iszengxin
opened
2 weeks ago
1
Inference with the MoE based GPT model trained by ds_pretrain_gpt_345M_MoE128.sh [BUG]
#5647
haoranlll
opened
2 weeks ago
0
[BUG] File not found in autotuner cache in multi-node setting on SLURM
#5646
jubueche
opened
2 weeks ago
1
Why doesn't deepspeed stage 3 allow a batch size of 1 with multiple GPUs?
#5645
AceMcAwesome77
opened
2 weeks ago
0
[BUG] RuntimeError encountered when generating tokens from a Meta-Llama-3-8B-Instruct model initialized with 4-bit or 8-bit quantization
#5644
Atry
opened
2 weeks ago
2
Fix memory leak from _hp_mapping
#5643
chiragjn
closed
2 days ago
1
[BUG] 1 line logic issue: flipped sign/direction in `_partition_param_sec` of `partition_parameters.py`?
#5642
dukleryoni
closed
1 week ago
1
[BUG] tortoise_tts.py fails on deepspeed/pydantic error
#5641
tholonia
opened
2 weeks ago
1
Does deepspeed support aarch64?
#5640
khayamgondal
opened
2 weeks ago
6
[HELP] How to safely switch trainable parameters in ZeRO-3 stage?
#5639
Ledzy
closed
1 week ago
2
Install errors on Windows
#5638
xalteropsx
closed
2 weeks ago
5
Deepspeed zero3 + qlora arise problem! Params didn't sharded first before load to each GPU!
#5637
CHNRyan
opened
2 weeks ago
0
[BUG] 4-bit quantized models would repeatedly generate the same tokens when bf16.enabled is true
#5636
Atry
opened
2 weeks ago
1
Deepspeed stage 3 hanging after 1st validation sample
#5635
AceMcAwesome77
opened
2 weeks ago
0
[BUG] RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
#5634
fahadh4ilyas
opened
2 weeks ago
1
Monitor was always enabled causing performance degradation
#5633
deepcharm
closed
2 weeks ago
2
stage_1_and_2: optimize clip calculation to use clamp
#5632
nelyahu
closed
2 weeks ago
0
[BUG] is_zero_init_model is always False when I'm using zero_init!
#5631
CHNRyan
opened
3 weeks ago
4
[BUG] RuntimeError encountered when generating tokens from a DeepSpeedHybridEngine initialized with 4-bit quantization.
#5630
Atry
opened
3 weeks ago
2
Pin transformers version for MII tests
#5629
loadams
closed
3 weeks ago
0
Pin accelerate version to 0.30.1
#5628
loadams
closed
3 weeks ago
0
[BUG] 1: error: must run as root and 2: raise RuntimeError("Ninja is required to load C++ extensions")
#5627
YangBrooksHan
opened
3 weeks ago
0
reduce all-to-all communication volume when both expert and non-expert are tensor-parallel
#5626
taozhiwei
opened
3 weeks ago
15
Hybrid Offloading for ZeRO3
#5625
tohtana
opened
3 weeks ago
0
fix: quantization with DeepSpeed HE
#5624
Atry
opened
3 weeks ago
2
[BUG] RuntimeError: Error building extension 'fused_adam' Loading extension module fused_adam
#5623
JinQiangWang2021
opened
3 weeks ago
0
Updated hpu-gaudi2 tests content.
#5622
vshekhawat-hlab
closed
3 weeks ago
1
Test just updating HPU docker image
#5621
loadams
closed
3 weeks ago
1
[REQUEST] Moving a trainable model with an optimiser between GPU and CPU
#5620
kfertakis
opened
3 weeks ago
0
[BUG] Pipeline Dataloader Samler: `shuffle=False`
#5619
Coobiw
opened
3 weeks ago
0
[BUG] ZeRO optimizer with MoE Expert Parallelism
#5618
Jack47
opened
3 weeks ago
1
[HELP] ZeRO3 partition parameters after fully load to each GPU!
#5617
CHNRyan
closed
1 week ago
7
nv-ds-chat CI test failure
#5616
github-actions[bot]
opened
3 weeks ago
0
Reset Optimizer
#5615
ahorazahedi
closed
3 weeks ago
1
Add support for Phi-3 small to FastGen
#5614
adk9
opened
3 weeks ago
0
fixes in _partition_param_sec function
#5613
mmhab
closed
2 weeks ago
0
[INF] Enable torch compile for inference
#5612
oelayan7
opened
3 weeks ago
5
Add compile backend arg for test_set_compiler_fn
#5611
vshekhawat-hlab
closed
3 weeks ago
2
Upgrade HPU image to v1.16.2.
#5610
vshekhawat-hlab
opened
3 weeks ago
0
Fixed Windows inference build.
#5609
costin-eseanu
closed
4 days ago
0
Add an argument to enable the injection of missing state during the conversion of universal checkpoints
#5608
xylian86
closed
2 days ago
0
# [REQUEST] Upstream modifications of PaRO
#5607
youshaox
opened
3 weeks ago
0
Previous
Next