microsoft DeepSpeedExamples issues

microsoft / DeepSpeedExamples

Example models using DeepSpeed

Apache License 2.0

6.1k stars 1.04k forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

How can I change the master_port when using deepspeed for multi-GPU on single node, i.e. localhost

#936 lovedoubledan opened 2 days ago
3
RuntimeError: CUDA error: no kernel image is available for execution on the device

#935 mrpeerat closed 1 week ago
1
No module named 'transformers.deepspeed'

#934 TianyuJIAA closed 3 weeks ago
2
Fixed mistake in readme

#933 SCheekati closed 3 weeks ago
0
Does DeepSpeed's Pipeline-Parallelism optimizer supports skip connections?

#932 RoyMahlab opened 1 month ago
0
[cifar ds training]: Set cuda device during initialization of distributed backend.

#931 jagadish-amd closed 3 weeks ago
3
Εnable reward model offloading option

#930 kfertakis closed 3 weeks ago
2
Deepspeed-Domino

#929 zhangsmallshark closed 2 weeks ago
3
After using steps 1, 2, and 3, the test reply content only replies Assistant: </s>。

#928 jianmomo closed 2 months ago
0
Remove the fixed `eot_token` mechanism for SFT

#927 Xingfu-Yi closed 3 weeks ago
2
Update requirements for opencv-python CVE

#925 loadams closed 2 months ago
0
AttributeError： 'DeepSpeedEngine' object has no attribute 'model'，

#924 lovychen closed 3 weeks ago
1
How to calculate training efficiency ,i.e tokens/sec of step 1 fine tuning of llama2 model ?

#923 sowmya04101998 opened 2 months ago
0
Actor loss nan and Resizing model embedding

#922 ouyanmei opened 2 months ago
1
DeepNVMe ZeRO-inf Tutorial

#921 jomayeri closed 2 months ago
0
FileNotFoundError: [Errno 2] No such file or directory: 'numactl'

#920 zhiwentian closed 2 weeks ago
6
DeepNVMe README.md add xref

#919 stas00 closed 3 months ago
0
Update README.md

#916 keshavkowshik closed 3 months ago
0
step2 without any response for a long time

#915 asfadfaf opened 3 months ago
0
DeepNVMe example scripts

#914 tjruwase closed 3 months ago
0
Add openai client to deepspeedometer

#913 delock closed 3 months ago
2
Different zero stage the training memory compute

#912 Arcmoon-Hu opened 4 months ago
0
nvcc fatal : Unsupported gpu architecture 'compute_86' and nvcc fatal : Value 'c++17' is not defined for option 'std'

#911 Xccanxin closed 4 months ago
1
How to start deepspeed automatically?

#910 qwerfdsadad closed 2 months ago
2
Consult the first phase.

#909 csxrzhang closed 3 months ago
2
an error with gradient checkpointing in DeepspeedChat step 3

#908 wangyuwen1999 opened 4 months ago
0
单机多卡进行RLHF在第三步中使用Qwen模型作Actor Model报错

#907 Dakai798 opened 5 months ago
1
DeepSpeed-Chat step-1 hanging for a long time

#906 lemon-little opened 5 months ago
0
Enable cpu/xpu support for the benchmarking suite

#905 louie-tsai closed 3 months ago
8
CPU OOM when inferencing Llama3-70B-Chinese-Chat

#904 GORGEOUSLCX opened 6 months ago
0
cannot pickle 'Stream' object

#903 teis-e opened 6 months ago
0
can not run the test-gpt.sh because of assertionError

#902 leachee99 opened 6 months ago
0
请问fastgen 是否支持长文本和序列并行推理

#901 AceCoder0 opened 6 months ago
0
Add --client-only arg to mii benchmark

#900 delock closed 5 months ago
0
Refactored LLM benchmark code

#899 mrwyattii closed 4 months ago
0
fix bug with queue.empty not being reliable

#898 mrwyattii closed 6 months ago
0
Update tokens_per_sec calculation to work w/ stream and non-stream cases

#897 lekurile closed 6 months ago
0
run-example.sh fails with urllib3.exceptions.ProtocolError: Response ended prematurely

#896 awan-10 closed 4 months ago
11
updating tokens per second to include the token count of generated tokens.

#895 guptha23 closed 6 months ago
0
[Error] AutoTune: `connect to host localhost port 22: Connection refused`

#894 wqw547243068 opened 7 months ago
0
How to use deepspeed for multi-node and multi-card task in slurm cluster

#893 dshwei opened 7 months ago
0
Does Zero-Inference support TP?

#892 preminstrel opened 7 months ago
11
extend max_prompt_length and input text for 128k evaluation

#891 HeyangQin closed 2 months ago
0
Deepspeed support finetune extra model with lora ?

#890 wanghongqu opened 7 months ago
1
不同机器上python环境变量路径不同，deepspeed启动后发现找不到其他机器的python环境，如何解决

#889 liqwertyu closed 2 months ago
0
when calculating actor loss, why the mask is "action_mask[:, start: ] "

#888 fancghit closed 7 months ago
0
The actor constantly generates ['</s>'] or ['<|endoftext|></s>'] after 200 steps in RLHF with hybrid engine disabled

#887 mousewu opened 7 months ago
1
About multiple-thread attention computation on CPU using zero-inference example.

#886 luckyq opened 7 months ago
0
Suggested GPU to run the demo code of step2_reward_model_finetuning (DeepSpeed-Chat)

#885 wenbozhangjs opened 7 months ago
0
[REQUEST] More fine-grained distributed strategies for RLHF training

#884 youshaox opened 7 months ago
0