issues
search
microsoft
/
Megatron-DeepSpeed
Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.9k
stars
345
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
fix init issue for silently ignoring the deepspeed config
#452
xylian86
closed
1 month ago
0
enable profiler for specific ranks
#451
ranzhejiang
closed
1 month ago
0
[Bug]Fix init issue for layer_norm in sequence_parallel for non-CUDA device.
#450
ys950902
opened
1 month ago
2
Model conversion problem
#449
yuanzhiyong1999
opened
1 month ago
1
[Bug]Fix init issue for rms_norm in sequence_parallel.
#448
ys950902
closed
1 month ago
1
Async allreduce for tensor-parallel
#447
drcanchi
opened
2 months ago
0
[TRACKER] Customer support related PR tracker for Intel devices
#446
delock
opened
2 months ago
0
fix moe tflops
#445
ranzhejiang
closed
1 month ago
1
how to calcuate the training throughput
#444
bigtree2020
opened
2 months ago
0
llama3 and llama3.1 support
#443
fmiao2372
opened
2 months ago
1
[Bug] Missing weight gradients from LinearWithGradAccumulationAndAsyncCommunication when Zero Bubble Pipeline Parallelism Is disabled
#442
mksit
opened
2 months ago
0
Adding the new feature of FPDT
#441
YJHMITWEB
opened
2 months ago
6
Optimizer problem when using finetune_llama.sh
#440
Kaiizx
opened
2 months ago
3
zero3 The checkpoint being loaded used a DP world size of 8 but the current world size is 16. Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.
#439
ArtificialZeng
opened
2 months ago
0
[XPU] Enable empty cache on XPU device
#438
ys950902
closed
2 months ago
2
Why pretrain_llama_distributed.sh use pretrain_gpt.py ?
#437
BrucePeng92
opened
3 months ago
0
[XPU] Add device check when import IPEX
#436
ys950902
closed
3 months ago
3
[bug]: `ipex` install breaks non `xpu` devices
#435
saforem2
opened
3 months ago
2
[NaN] Fix nan print issue when running Megatron-Deepspeed with DeepSpeed
#434
ys950902
closed
3 months ago
6
pass batch_dim_idx to deepspeed sequence parallel distributed attenti
#433
YJHMITWEB
closed
3 months ago
0
[LLaMa] Adding support converting checkpoint from mds to hf
#432
billishyahao
closed
3 months ago
0
[XPU] Support fused_rms_norm on XPU device
#431
ys950902
closed
3 months ago
4
A tutorial to help you finetune LLama-2-7b using this repository full of garbarge code with ZeRO2/3 enabled.
#430
LLMChild
opened
4 months ago
1
Enable Sequence Parallelism
#429
polisettyvarma
closed
2 months ago
10
[Bug] grad_weight can't be NoneType when running with DeepSpeed on Zero3.
#428
ys950902
closed
2 months ago
8
Update yml to be valid
#427
loadams
closed
4 months ago
0
Add basic workflow to test compilation
#426
loadams
closed
4 months ago
0
Bug: TP=1, pretrain_llama2_distributed failed on H800 gpus!
#425
asr-sheep1
closed
3 months ago
2
[Finetune] enable converting checkpoints without optimizer state generation
#424
billishyahao
closed
1 week ago
0
Add a basic check for formatting or python compile to Megatron-DeepSpeed
#423
loadams
closed
4 months ago
0
[wandb] disable wandb more gracefully
#422
billishyahao
closed
2 months ago
1
add support to run custom Hf tokenizer for training and dataset pre-processing
#421
polisettyvarma
closed
4 months ago
0
acquire device when required
#420
polisettyvarma
closed
4 months ago
0
improve repeat_kv GQA perf
#419
polisettyvarma
closed
4 months ago
0
Extend test utilities to support more accelerators
#418
xinyu-intel
closed
4 months ago
0
[Bug] Fix crash when logging optimizer state to tensorboard
#417
billishyahao
closed
2 months ago
0
[Wandb] Refine wandb logging function
#416
billishyahao
closed
4 months ago
1
support split qkv linear and sp overlap comm
#415
inkcherry
opened
4 months ago
6
add PyTorch profiler support
#414
polisettyvarma
closed
4 months ago
0
fix --use-cpu-initialization error when expert is not tensor-parallel
#413
taozhiwei
opened
4 months ago
3
add kill switch file support to gracefully exit training at runtime
#412
polisettyvarma
closed
4 months ago
4
improve performance by keeping attention_mask on device and run ops further on device
#411
polisettyvarma
closed
4 months ago
0
Improve RoPE perf by using cached sin/cos tensors
#410
polisettyvarma
closed
4 months ago
2
use split/squeeze instead of slice for performance
#409
polisettyvarma
closed
4 months ago
2
Set proper arguments when constructing models in unit tests
#408
xinyu-intel
closed
4 months ago
3
Fixed missing BookCorpus dataset in the sequence parallelism example.
#407
costin-eseanu
closed
4 months ago
0
fixing the bug of flash_attn import and the wrong gather index when using flash_attn_cuda in sequence parallel
#406
YJHMITWEB
closed
3 months ago
0
How to resume training between GPTModel() checkpoint and GPTModelPipe() checkpoint?
#405
tiggerwu
opened
4 months ago
0
Fix test_deallocate_output_tensor
#404
xinyu-intel
closed
4 months ago
1
Fix ParallelMLP and enable accelerator test
#403
xinyu-intel
closed
4 months ago
1
Next