microsoft DeepSpeed issues

microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

https://www.deepspeed.ai/

Apache License 2.0

33.6k stars 3.94k forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

[BUG] Running llama2-7b step3 with tensor parallel and HE fails due to incompatible shapes

#5656 ShellyNR opened 2 weeks ago
0
[BUG]模型卡在trainer.train()一直不训练

#5655 limllzu closed 2 weeks ago
0
Fix latest pytorch '_get_socket_with_port' import error

#5654 Yejing-Lai closed 1 week ago
6
[BUG] oneapi/ccl.hpp: No such file or directory.

#5653 weiji14 opened 2 weeks ago
1
Fix hpZ with zero element

#5652 samadejacobs closed 1 week ago
0
Update version.txt after 0.14.3 release

#5651 mrwyattii closed 2 weeks ago
0
Unpin transformers version

#5650 loadams opened 2 weeks ago
0
Install issue with setuptools 70

#5649 myBigbug closed 2 weeks ago
2
RuntimeError: still have inflight params[BUG]

#5648 iszengxin opened 2 weeks ago
1
Inference with the MoE based GPT model trained by ds_pretrain_gpt_345M_MoE128.sh [BUG]

#5647 haoranlll opened 2 weeks ago
0
[BUG] File not found in autotuner cache in multi-node setting on SLURM

#5646 jubueche opened 2 weeks ago
1
Why doesn't deepspeed stage 3 allow a batch size of 1 with multiple GPUs?

#5645 AceMcAwesome77 opened 2 weeks ago
0
[BUG] RuntimeError encountered when generating tokens from a Meta-Llama-3-8B-Instruct model initialized with 4-bit or 8-bit quantization

#5644 Atry opened 2 weeks ago
2
Fix memory leak from _hp_mapping

#5643 chiragjn closed 2 days ago
1
[BUG] 1 line logic issue: flipped sign/direction in `_partition_param_sec` of `partition_parameters.py`?

#5642 dukleryoni closed 1 week ago
1
[BUG] tortoise_tts.py fails on deepspeed/pydantic error

#5641 tholonia opened 2 weeks ago
1
Does deepspeed support aarch64?

#5640 khayamgondal opened 2 weeks ago
6
[HELP] How to safely switch trainable parameters in ZeRO-3 stage?

#5639 Ledzy closed 1 week ago
2
Install errors on Windows

#5638 xalteropsx closed 2 weeks ago
5
Deepspeed zero3 + qlora arise problem! Params didn't sharded first before load to each GPU!

#5637 CHNRyan opened 2 weeks ago
0
[BUG] 4-bit quantized models would repeatedly generate the same tokens when bf16.enabled is true

#5636 Atry opened 2 weeks ago
1
Deepspeed stage 3 hanging after 1st validation sample

#5635 AceMcAwesome77 opened 2 weeks ago
0
[BUG] RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

#5634 fahadh4ilyas opened 2 weeks ago
1
Monitor was always enabled causing performance degradation

#5633 deepcharm closed 2 weeks ago
2
stage_1_and_2: optimize clip calculation to use clamp

#5632 nelyahu closed 2 weeks ago
0
[BUG] is_zero_init_model is always False when I'm using zero_init!

#5631 CHNRyan opened 3 weeks ago
4
[BUG] RuntimeError encountered when generating tokens from a DeepSpeedHybridEngine initialized with 4-bit quantization.

#5630 Atry opened 3 weeks ago
2
Pin transformers version for MII tests

#5629 loadams closed 3 weeks ago
0
Pin accelerate version to 0.30.1

#5628 loadams closed 3 weeks ago
0
[BUG] 1: error: must run as root and 2: raise RuntimeError("Ninja is required to load C++ extensions")

#5627 YangBrooksHan opened 3 weeks ago
0
reduce all-to-all communication volume when both expert and non-expert are tensor-parallel

#5626 taozhiwei opened 3 weeks ago
15
Hybrid Offloading for ZeRO3

#5625 tohtana opened 3 weeks ago
0
fix: quantization with DeepSpeed HE

#5624 Atry opened 3 weeks ago
2
[BUG] RuntimeError: Error building extension 'fused_adam' Loading extension module fused_adam

#5623 JinQiangWang2021 opened 3 weeks ago
0
Updated hpu-gaudi2 tests content.

#5622 vshekhawat-hlab closed 3 weeks ago
1
Test just updating HPU docker image

#5621 loadams closed 3 weeks ago
1
[REQUEST] Moving a trainable model with an optimiser between GPU and CPU

#5620 kfertakis opened 3 weeks ago
0
[BUG] Pipeline Dataloader Samler: `shuffle=False`

#5619 Coobiw opened 3 weeks ago
0
[BUG] ZeRO optimizer with MoE Expert Parallelism

#5618 Jack47 opened 3 weeks ago
1
[HELP] ZeRO3 partition parameters after fully load to each GPU!

#5617 CHNRyan closed 1 week ago
7
nv-ds-chat CI test failure

#5616 github-actions[bot] opened 3 weeks ago
0
Reset Optimizer

#5615 ahorazahedi closed 3 weeks ago
1
Add support for Phi-3 small to FastGen

#5614 adk9 opened 3 weeks ago
0
fixes in _partition_param_sec function

#5613 mmhab closed 2 weeks ago
0
[INF] Enable torch compile for inference

#5612 oelayan7 opened 3 weeks ago
5
Add compile backend arg for test_set_compiler_fn

#5611 vshekhawat-hlab closed 3 weeks ago
2
Upgrade HPU image to v1.16.2.

#5610 vshekhawat-hlab opened 3 weeks ago
0
Fixed Windows inference build.

#5609 costin-eseanu closed 4 days ago
0
Add an argument to enable the injection of missing state during the conversion of universal checkpoints

#5608 xylian86 closed 2 days ago
0
# [REQUEST] Upstream modifications of PaRO

#5607 youshaox opened 3 weeks ago
0

Previous Next