issues
search
microsoft
/
DeepSpeed
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
33.56k
stars
3.93k
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
[REQUEST]win10 install fail build_win.bat
#5698
shark-xiake
opened
8 hours ago
1
Disable nvtx decorator to avoid graph break
#5697
tohtana
opened
1 day ago
0
Tensor(hidden states)missing across GPU in Pipeline Parallelism Training[BUG]
#5696
Youngluc
opened
1 day ago
0
I am not able to reproduce it. May be I am missing some details. Can you please give me the exact steps & set up you followed?
#5695
faraon-bot
closed
2 days ago
0
Update version.txt after 0.14.4 release
#5694
mrwyattii
closed
5 days ago
0
[BUG] Deepspeed does not seem to work using GPUDirect, should it?
#5693
m-harmonic
closed
1 hour ago
3
[BUG] Regression: 0.14.3 causes grad_norm to be zero
#5692
rosario-purple
opened
5 days ago
1
sequence parallel with communication overlap
#5691
inkcherry
opened
5 days ago
0
[ERROR] [launch.py:321:sigkill_handler] exits with return code = -11
#5690
shag1802
opened
5 days ago
0
Running out of CPU memory. Dataset is loaded for each created process
#5689
MikeMitsios
opened
5 days ago
0
ENV var added for recaching in INF Unit tests
#5688
raza-sikander
opened
6 days ago
1
inference unit test injectionPolicy split world_size to multiple tests
#5687
oelayan7
opened
6 days ago
2
Update BUFSIZE to come from autotuner's constants.py, not numpy
#5686
loadams
closed
6 days ago
0
[BUG] inference ValueError
#5685
zxrneu
opened
1 week ago
0
Switch from torch.cuda.amp.custom_fwd to torch.amp.custom_fwd(device=...)
#5684
loadams
opened
1 week ago
0
Fix memory leak
#5683
chiragjn
closed
1 week ago
0
[BUG] Logs full of FutureWarning when training with nightly PyTorch
#5682
rosario-purple
opened
1 week ago
1
Bug fix for the "Link bit16 and fp32 parameters in partition"
#5681
U-rara
opened
1 week ago
1
Fix numpy upgrade to 2.0.0 BUFSIZE import error
#5680
Yejing-Lai
closed
1 week ago
2
Bug Report: Issues Building DeepSpeed on Windows
#5679
Moemu
closed
1 week ago
4
[BUG] Using and Building DeepSpeedCPUAdam
#5677
oabuhamdan
opened
1 week ago
18
Switch what versions of python are supported
#5676
loadams
opened
1 week ago
0
Update elastic_agent.py. Delete _get_socket_with_port import
#5675
QiaoZhennn
closed
1 week ago
3
[BUG] GPU memory leaking after deleting deepspeed engine
#5674
kfertakis
closed
10 hours ago
2
Updates needed for NumPy 2+
#5673
loadams
closed
1 week ago
0
Pin numpy to a pre-2.0.0 version
#5672
loadams
closed
1 week ago
0
[BUG] DeepSpeed on pypi not compatible with latest `numpy`
#5671
younesbelkada
closed
1 week ago
5
[XPU] adapt lazy_call func to different versions
#5670
YizhouZ
closed
1 week ago
0
[CPU] add fp16 support to shm inference_all_reduce
#5669
delock
opened
1 week ago
4
Update xpu-max1100.yml with new config and add some tests
#5668
Liangliang-Ma
opened
1 week ago
0
Getting parameters of embeddings (safe_get_local_fp32_param)and setting the weight of embeddings (safe_set_local_fp32_param) does not work (bug?).
#5667
Git-Shaw
opened
1 week ago
0
fix IDEX dependence in xpu accelerator
#5666
Liangliang-Ma
closed
1 week ago
1
How to set different learning rates for different parameters of LLMs
#5665
jpWang
closed
1 week ago
0
Fixing the reshape bug in sequence parallel alltoall, which corrupted all QKV data
#5664
YJHMITWEB
closed
1 week ago
1
AssertionError: Unable to pre-compile ops without torch installed. Please install torch before attempting to pre-compile ops.
#5663
nitinmukesh
opened
1 week ago
3
[BUG] 'Invalidate trace cache' with Seq2SeqTrainer+predict_with_generate+Zero3
#5662
Osterlohe
opened
1 week ago
0
does DeepSpeed support AMSP (a new DP shard strategy)
#5661
guoyejun
opened
1 week ago
0
Fail to use zero_init to construct llama2 with deepspeed zero3 and bnb!
#5660
CHNRyan
opened
1 week ago
0
RuntimeError: Error building extension 'cpu_adam', because /usr/bin/ld: can not find -lcurand,help!
#5659
hekaijie123
opened
1 week ago
1
Add and Remove ZeRO 3 Hooks
#5658
jomayeri
opened
1 week ago
1
[BUG] Running llama2-7b step3 with tensor parallel and HE fails due to incompatible shapes
#5656
ShellyNR
opened
1 week ago
0
[BUG]模型卡在trainer.train()一直不训练
#5655
limllzu
closed
1 week ago
0
Fix latest pytorch '_get_socket_with_port' import error
#5654
Yejing-Lai
closed
1 week ago
6
[BUG] oneapi/ccl.hpp: No such file or directory.
#5653
weiji14
opened
1 week ago
1
Fix hpZ with zero element
#5652
samadejacobs
closed
1 week ago
0
Update version.txt after 0.14.3 release
#5651
mrwyattii
closed
1 week ago
0
Unpin transformers version
#5650
loadams
opened
2 weeks ago
0
Install issue with setuptools 70
#5649
myBigbug
closed
1 week ago
2
RuntimeError: still have inflight params[BUG]
#5648
iszengxin
opened
2 weeks ago
1
Inference with the MoE based GPT model trained by ds_pretrain_gpt_345M_MoE128.sh [BUG]
#5647
haoranlll
opened
2 weeks ago
0
Next