Closed Gabriel4256 closed 1 year ago
I managed to create InferenceEngine
by adding some configs, but other problem occurs when running forward pass of it.
Following is the revised pretrain_gpt.py
:
from megatron.training import initialize_megatron, get_model, forward_backward_pipelining_with_interleaving, forward_backward_pipelining_without_interleaving, forward_backward_no_pipelining
from megatron import get_args, initialize_megatron
import deepspeed
initialize_megatron(extra_args_provider=None, args_defaults={'tokenizer_type': 'GPT2BPETokenizer'})
model = get_model(model_provider)
args = get_args()
model_engine = deepspeed.init_inference(
model[0],
moe_experts=args.num_experts,
replace_with_kernel_inject=True,
dtype = torch.half if args.fp16 else None,
moe=True,
)
model = model_engine.module
args.iteration = 0
train_data_iterator, valid_data_iterator, test_data_iterator \
= build_train_valid_test_data_iterators(
train_valid_test_datasets_provider)
forward_step(test_data_iterator, model, None)
And this leads to another error as follows:
Traceback (most recent call last):
File "/home/ubuntu/frameworks/Megatron-DeepSpeed/examples/MoE/../../pretrain_gpt.py", line 391, in <module>
forward_step(test_data_iterator, model, None)
File "/home/ubuntu/frameworks/Megatron-DeepSpeed/examples/MoE/../../pretrain_gpt.py", line 211, in forward_step
Traceback (most recent call last):
File "/home/ubuntu/frameworks/Megatron-DeepSpeed/examples/MoE/../../pretrain_gpt.py", line 391, in <module>
output_tensor, *other_losses = model(tokens, position_ids, attention_mask,
File "/home/ubuntu/miniconda3/envs/tutel/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1148, in _call_impl
result = forward_call(*input, **kwargs)forward_step(test_data_iterator, model, None)
File "/home/ubuntu/frameworks/Megatron-DeepSpeed/megatron/model/gpt_model.py", line 120, in forward
File "/home/ubuntu/frameworks/Megatron-DeepSpeed/examples/MoE/../../pretrain_gpt.py", line 211, in forward_step
lm_output, *moe_losses = self.language_model(
File "/home/ubuntu/miniconda3/envs/tutel/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
output_tensor, *other_losses = model(tokens, position_ids, attention_mask,
File "/home/ubuntu/miniconda3/envs/tutel/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1148, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/frameworks/Megatron-DeepSpeed/megatron/model/language_model.py", line 389, in forward
result = forward_call(*input, **kwargs)
File "/home/ubuntu/frameworks/Megatron-DeepSpeed/megatron/model/gpt_model.py", line 120, in forward
encoder_output, *moe_losses = self.encoder(encoder_input,lm_output, *moe_losses = self.language_model(
File "/home/ubuntu/miniconda3/envs/tutel/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
File "/home/ubuntu/miniconda3/envs/tutel/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/frameworks/Megatron-DeepSpeed/megatron/model/transformer.py", line 769, in forward
return forward_call(*input, **kwargs)
File "/home/ubuntu/frameworks/Megatron-DeepSpeed/megatron/model/language_model.py", line 389, in forward
encoder_output, *moe_losses = self.encoder(encoder_input,
File "/home/ubuntu/miniconda3/envs/tutel/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
hidden_states, moe_losses = self._checkpointed_forward(hidden_states,
File "/home/ubuntu/frameworks/Megatron-DeepSpeed/megatron/model/transformer.py", line 719, in _checkpointed_forward
hidden_states, *local_moe_losses = mpu.checkpoint(return forward_call(*input, **kwargs)
File "/home/ubuntu/miniconda3/envs/tutel/lib/python3.9/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 748, in checkpoint
File "/home/ubuntu/frameworks/Megatron-DeepSpeed/megatron/model/transformer.py", line 769, in forward
hidden_states, moe_losses = self._checkpointed_forward(hidden_states,CheckpointFunction.apply(function, all_outputs, *args)
File "/home/ubuntu/frameworks/Megatron-DeepSpeed/megatron/model/transformer.py", line 719, in _checkpointed_forward
File "/home/ubuntu/miniconda3/envs/tutel/lib/python3.9/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 582, in forward
outputs = run_function(*inputs_cuda)
File "/home/ubuntu/frameworks/Megatron-DeepSpeed/megatron/model/transformer.py", line 709, in custom_forward
hidden_states, *local_moe_losses = mpu.checkpoint(
File "/home/ubuntu/miniconda3/envs/tutel/lib/python3.9/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 748, in checkpoint
x_, moe_loss = layer(x_, attention_mask, encoder_output, enc_dec_attn_mask)
ValueError: not enough values to unpack (expected 2, got 1)
CheckpointFunction.apply(function, all_outputs, *args)
File "/home/ubuntu/miniconda3/envs/tutel/lib/python3.9/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 582, in forward
outputs = run_function(*inputs_cuda)
File "/home/ubuntu/frameworks/Megatron-DeepSpeed/megatron/model/transformer.py", line 709, in custom_forward
x_, moe_loss = layer(x_, attention_mask, encoder_output, enc_dec_attn_mask)
ValueError: not enough values to unpack (expected 2, got 1)
[2022-08-07 09:25:24,352] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 484730
[2022-08-07 09:25:24,360] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 484731
[2022-08-07 09:25:24,360] [ERROR] [launch.py:292:sigkill_handler] ['/home/ubuntu/miniconda3/envs/tutel/bin/python3.9', '-u', '/home/ubuntu/frameworks/Megatron-DeepSpeed/examples/MoE/../../pretrain_gpt.py', '--local_rank=1', '--override-lr-scheduler', '--adam-beta1', '0.9', '--adam-beta2', '0.95', '--tensor-model-parallel-size', '1', '--moe-expert-parallel-size', '2', '--num-experts', '2', '--moe-loss-coeff', '0.01', '--moe-train-capacity-factor', '1.0', '--moe-eval-capacity-factor', '1.0', '--moe-min-capacity', '4', '--init-method-std', '0.014', '--lr-decay-tokens', '300000000000', '--lr-warmup-tokens', '375000000', '--micro-batch-size', '8', '--exit-duration-in-mins', '5', '--global-batch-size', '256', '--num-layers', '12', '--hidden-size', '768', '--num-attention-heads', '12', '--seq-length', '2048', '--max-position-embeddings', '2048', '--train-tokens', '300000000000', '--train-iters', '0', '--lr', '1.2e-4', '--min-lr', '1.0e-6', '--lr-decay-style', 'cosine', '--split', '94,3,3', '--log-interval', '10', '--eval-interval', '100', '--eval-iters', '10', '--save-interval', '10', '--weight-decay', '0.1', '--clip-grad', '1.0', '--hysteresis', '2', '--num-workers', '0', '--fp16', '--load', '/home/ubuntu/frameworks/Megatron-DeepSpeed/examples/MoE/output/checkpoint/gpt-0.125B-lr-1.2e-4-minlr-1.0e-6-bs-256-gpus-2-mp-1-pp-1-ep-2-mlc-0.01-cap-1.0-drop-true', '--save', '/home/ubuntu/frameworks/Megatron-DeepSpeed/examples/MoE/output/checkpoint/gpt-0.125B-lr-1.2e-4-minlr-1.0e-6-bs-256-gpus-2-mp-1-pp-1-ep-2-mlc-0.01-cap-1.0-drop-true', '--tensorboard-queue-size', '1', '--log-timers-to-tensorboard', '--log-batch-size-to-tensorboard', '--log-validation-ppl-to-tensorboard', '--tensorboard-dir', '/home/ubuntu/frameworks/Megatron-DeepSpeed/examples/MoE/output/tensorboard/gpt-0.125B-lr-1.2e-4-minlr-1.0e-6-bs-256-gpus-2-mp-1-pp-1-ep-2-mlc-0.01-cap-1.0-drop-true_f09_2022.08.07-09.25.13', '--inference', '--checkpoint-activations', '--create-moe-param-group', '--vocab-file', '/home/ubuntu/frameworks/Megatron-DeepSpeed/gpt2-vocab.json', '--merge-file', '/home/ubuntu/frameworks/Megatron-DeepSpeed/gpt2-merges.txt', '--data-path', '/home/ubuntu/frameworks/Megatron-DeepSpeed/dataset/BookCorpusDataset/BookCorpusDataset_text_document', '--data-impl', 'mmap', '--deepspeed', '--deepspeed_config', 'ds_config_gpt_gpt-0.125B-lr-1.2e-4-minlr-1.0e-6-bs-256-gpus-2-mp-1-pp-1-ep-2-mlc-0.01-cap-1.0-drop-true.json', '--pipeline-model-parallel-size', '1', '--no-pipeline-parallel', '--deepspeed-activation-checkpointing'] exits with return code = 1
@Gabriel4256 -- please look at this example: https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples/generate_text.sh
For inference, we don't use the pretrain_gpt.py as an entry-point. Please try the above text generation scenario that uses deepspeed inference. If you run into issues with that, please share with us.
@awan-10 Thanks for the comment. Unfortunately, I've already tried the example you shared and found it didn't work (https://github.com/microsoft/DeepSpeed/issues/2030#issuecomment-1193909540).
I've also tried this on a machine with v100 32G * 8, but failed with almost same error. Does the script only run on A100?
Closing to move discussion to #2030, please re-open if the core issue here is not covered in the other issue.
Describe the bug When I use
deepspeed.init_inference
for Megatron gpt-3 MoE model in megatron repo, error occurs. There's no problem when I usedeepspeed.init
instead as done in the training stage, but it seemsinit_inferrence
is a proper function for inference. If there is no difference in performace at all, I will just usedeepspeed.init
.To Reproduce Steps to reproduce the behavior:
I used ds_pretrain_gpt_1.3B_MoE128.sh and pretrain_gpt.py with some modifications.
contents of
ds_pretrain_gpt_1.3B_MoE128.sh
(I used GPT-3 small 125M and, changed some gpu settings):contents of main function in
pretrain_gpt.py
:With these files, I executed
Expected behavior I expect
InferenceEngine
for the model is successfully created withinit_inference
.ds_report output
Screenshots Following is the error message:
System info (please complete the following information):
Launcher context I tested with modifed files and command written above.