microsoft Megatron-DeepSpeed issues

microsoft / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Other

1.9k stars 345 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

fix init issue for silently ignoring the deepspeed config

#452 xylian86 closed 1 month ago
0
enable profiler for specific ranks

#451 ranzhejiang closed 1 month ago
0
[Bug]Fix init issue for layer_norm in sequence_parallel for non-CUDA device.

#450 ys950902 opened 1 month ago
2
Model conversion problem

#449 yuanzhiyong1999 opened 1 month ago
1
[Bug]Fix init issue for rms_norm in sequence_parallel.

#448 ys950902 closed 1 month ago
1
Async allreduce for tensor-parallel

#447 drcanchi opened 2 months ago
0
[TRACKER] Customer support related PR tracker for Intel devices

#446 delock opened 2 months ago
0
fix moe tflops

#445 ranzhejiang closed 1 month ago
1
how to calcuate the training throughput

#444 bigtree2020 opened 2 months ago
0
llama3 and llama3.1 support

#443 fmiao2372 opened 2 months ago
1
[Bug] Missing weight gradients from LinearWithGradAccumulationAndAsyncCommunication when Zero Bubble Pipeline Parallelism Is disabled

#442 mksit opened 2 months ago
0
Adding the new feature of FPDT

#441 YJHMITWEB opened 2 months ago
6
Optimizer problem when using finetune_llama.sh

#440 Kaiizx opened 2 months ago
3
zero3 The checkpoint being loaded used a DP world size of 8 but the current world size is 16. Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.

#439 ArtificialZeng opened 2 months ago
0
[XPU] Enable empty cache on XPU device

#438 ys950902 closed 2 months ago
2
Why pretrain_llama_distributed.sh use pretrain_gpt.py ?

#437 BrucePeng92 opened 3 months ago
0
[XPU] Add device check when import IPEX

#436 ys950902 closed 3 months ago
3
[bug]: `ipex` install breaks non `xpu` devices

#435 saforem2 opened 3 months ago
2
[NaN] Fix nan print issue when running Megatron-Deepspeed with DeepSpeed

#434 ys950902 closed 3 months ago
6
pass batch_dim_idx to deepspeed sequence parallel distributed attenti

#433 YJHMITWEB closed 3 months ago
0
[LLaMa] Adding support converting checkpoint from mds to hf

#432 billishyahao closed 3 months ago
0
[XPU] Support fused_rms_norm on XPU device

#431 ys950902 closed 3 months ago
4
A tutorial to help you finetune LLama-2-7b using this repository full of garbarge code with ZeRO2/3 enabled.

#430 LLMChild opened 4 months ago
1
Enable Sequence Parallelism

#429 polisettyvarma closed 2 months ago
10
[Bug] grad_weight can't be NoneType when running with DeepSpeed on Zero3.

#428 ys950902 closed 2 months ago
8
Update yml to be valid

#427 loadams closed 4 months ago
0
Add basic workflow to test compilation

#426 loadams closed 4 months ago
0
Bug: TP=1, pretrain_llama2_distributed failed on H800 gpus!

#425 asr-sheep1 closed 3 months ago
2
[Finetune] enable converting checkpoints without optimizer state generation

#424 billishyahao closed 1 week ago
0
Add a basic check for formatting or python compile to Megatron-DeepSpeed

#423 loadams closed 4 months ago
0
[wandb] disable wandb more gracefully

#422 billishyahao closed 2 months ago
1
add support to run custom Hf tokenizer for training and dataset pre-processing

#421 polisettyvarma closed 4 months ago
0
acquire device when required

#420 polisettyvarma closed 4 months ago
0
improve repeat_kv GQA perf

#419 polisettyvarma closed 4 months ago
0
Extend test utilities to support more accelerators

#418 xinyu-intel closed 4 months ago
0
[Bug] Fix crash when logging optimizer state to tensorboard

#417 billishyahao closed 2 months ago
0
[Wandb] Refine wandb logging function

#416 billishyahao closed 4 months ago
1
support split qkv linear and sp overlap comm

#415 inkcherry opened 4 months ago
6
add PyTorch profiler support

#414 polisettyvarma closed 4 months ago
0
fix --use-cpu-initialization error when expert is not tensor-parallel

#413 taozhiwei opened 4 months ago
3
add kill switch file support to gracefully exit training at runtime

#412 polisettyvarma closed 4 months ago
4
improve performance by keeping attention_mask on device and run ops further on device

#411 polisettyvarma closed 4 months ago
0
Improve RoPE perf by using cached sin/cos tensors

#410 polisettyvarma closed 4 months ago
2
use split/squeeze instead of slice for performance

#409 polisettyvarma closed 4 months ago
2
Set proper arguments when constructing models in unit tests

#408 xinyu-intel closed 4 months ago
3
Fixed missing BookCorpus dataset in the sequence parallelism example.

#407 costin-eseanu closed 4 months ago
0
fixing the bug of flash_attn import and the wrong gather index when using flash_attn_cuda in sequence parallel

#406 YJHMITWEB closed 3 months ago
0
How to resume training between GPTModel() checkpoint and GPTModelPipe() checkpoint?

#405 tiggerwu opened 4 months ago
0
Fix test_deallocate_output_tensor

#404 xinyu-intel closed 4 months ago
1
Fix ParallelMLP and enable accelerator test

#403 xinyu-intel closed 4 months ago
1