issues
search
microsoft
/
Megatron-DeepSpeed
Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.9k
stars
345
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Alcf update readme
#402
saforem2
closed
5 months ago
1
Fix ParallelMLP and enable accelerator test
#401
xinyu-intel
closed
5 months ago
1
Fix test_deallocate_output_tensor
#400
xinyu-intel
closed
5 months ago
1
fix NAN loss of rope long context training
#399
inkcherry
opened
5 months ago
1
MOE TFLOPS calculation
#398
yingzhao27
opened
5 months ago
0
why moe can not use zero3
#397
kuangdao
opened
5 months ago
0
Add Zero Bubble Pipeline Parallelism H1 Schedule
#396
nvmdava
closed
4 months ago
6
update universal_checkpointing/README.md
#395
inkcherry
closed
4 months ago
2
convert mds checkpoint to Hf Llama model
#394
vksastry
opened
5 months ago
1
Convert to iteration based training supported by pretraining scripts
#393
zainsarwar865
closed
5 months ago
0
ds-sequence-parallel(ulysses) for rope.
#392
inkcherry
opened
5 months ago
0
Update/add GPT/Llama universal checkpointing scripts
#391
lekurile
closed
3 months ago
1
Fix trace output path
#390
saforem2
closed
6 months ago
1
Inquiry on Sequence Parallel Support for VocabParallelEmbedding
#389
qinxiangyujiayou
opened
6 months ago
0
add HFTokenizer option for preprocess_data
#388
Jianhong-Zhang
opened
6 months ago
0
about the optimizer param group
#387
L-hongbin
opened
6 months ago
0
屎山代码DeepSpeed
#386
ControllableGeneration
opened
6 months ago
3
Sequence Parallel is incompatible with Rotary Positional Embedding
#385
anogkongda
opened
6 months ago
4
Spurious all gather performance drop.
#384
etiennemlb
opened
6 months ago
0
Add steps and results for running ZeRO stage 3 with universal checkpoint
#383
xylian86
closed
4 months ago
1
Merge `alcf-tests` into `main`
#382
saforem2
closed
7 months ago
1
Call for Conversion from Huggingface to Megads with MoE
#381
ControllableGeneration
opened
7 months ago
0
Expert deepcopy raises PickleError
#380
sxontheway
opened
7 months ago
0
AttributeError: 'Namespace' object has no attribute 'deepspeed_config_dict'. Did you mean: 'deepspeed_config'? && batch = next(self.data_iterator)
#379
hi20240217
opened
7 months ago
2
Add layer norm weight plus 1
#378
Yejing-Lai
opened
7 months ago
1
Assertion failure when there are more than 255 tokenized data files (assert num_datasets < 255 in blendable_dataset.py)
#377
Jeronymous
opened
7 months ago
0
Fix ConstantGradScaler and loss-scale argument not match
#376
BeingGod
opened
7 months ago
1
Support Llama2Tokenizer
#375
jinyouzhi
opened
7 months ago
0
get distributed backend name via accelerator and check loss_scale before writing to tb
#374
polisettyvarma
closed
6 months ago
0
Support MoE for GPTModelPipe
#373
mosheisland
closed
7 months ago
5
remove contiguous copy for flash-attn opbuilder
#372
YizhouZ
closed
7 months ago
7
fix TFLOPs calculation
#371
polisettyvarma
closed
3 months ago
4
collect grad_norm for non pipeline path
#370
inkcherry
opened
8 months ago
0
Pipeline parallelism + CPU offload?
#369
webber26232
opened
8 months ago
0
Fix the error issue for DP on Megatron-DeepSpeed
#368
ys950902
closed
7 months ago
2
[BUG] Problems with Mixture-of-Experts (MoE)
#367
nikit-srivastava
opened
8 months ago
1
[REQUEST] Could you add a new release version tag to Megatron-Deepspeed?Thanks
#366
hijeffwu
closed
8 months ago
2
Mistral
#365
Kosei1227
closed
8 months ago
0
Bugs in GPT2 Inference Example
#364
JianzheXiao
opened
8 months ago
3
Add Parallel Attention mechanism of Mistral
#363
Kosei1227
closed
8 months ago
3
MOE: Support disable top2 2nd expert sampling
#362
mosheisland
closed
8 months ago
0
Support universal checkpoint for GPTModel
#361
mosheisland
closed
8 months ago
0
Fine-tune llama2 with sequence parallelism
#360
AnirudhVIyer
opened
8 months ago
3
Problem in hf2megads_weight_converter.py
#359
noob-ctrl
opened
8 months ago
0
Loss is increasing when fine-tuning from a Megatron-Deepspeed pretrained checkpoint.
#358
SefaZeng
opened
8 months ago
0
Unreasonably low throughput on HGX-H100s
#357
GuanhuaWang
opened
8 months ago
0
FileNotFoundError: [Errno 2] No such file or directory: 'dataset/index-cache/xxx_doc_idx.npy'
#356
GuanhuaWang
opened
8 months ago
6
fix a bug in `pretrain_bert.py`
#355
lzzmm
closed
8 months ago
0
Print total number of params when loading model
#354
nightingal3
closed
9 months ago
1
Updates in `megatron/data/{blendable_dataset.py, gpt_dataset.py, indexed_dataset.py}`
#353
saforem2
closed
9 months ago
1
Previous
Next