[BUG] Support for MoE model inference

flint-stone commented 2 years ago

Describe the bug Hi -- I'm trying to build an example to demonstrate expert parallelism feature as described here. I'm getting an error when initiating the inference engine with MoE option enabled.

To Reproduce Here's the code that cause the problem:

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
model = BertModel.from_pretrained("/mnt/checkpoint/").cuda()
ds_engine = deepspeed.init_inference(model, mp_size=1, dtype=torch.half, replace_method='auto', replace_with_kernel_inject=True, moe=True, moe_experts=6, ep_size=6)
model = ds_engine.module

output = model(**inputs)

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 1.11.0a0+b6df043
torch cuda version ............... 11.5
nvcc version ..................... 11.5
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.6.0+bea701a, bea701a, master
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.5

Screenshots

System info (please complete the following information):

OS: Ubuntu 20.04.3 LTS
GPU count and types : one machine with 2 V100

Interconnects (if applicable)


root@6d4e5d59f2ce:/mnt# nvidia-smi topo --matrix
GPU0    GPU1    mlx5_0  mlx5_1  CPU Affinity    NUMA Affinity
GPU0     X  SYS SYS SYS 0,2,4,6,8,10    0
GPU1    SYS  X  NODE    NODE    1,3,5,7,9,11    1
mlx5_0  SYS NODE     X  PIX     
mlx5_1  SYS NODE    PIX  X

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks


 - Python version: Python 3.8.12

**Launcher context**
From scripts

**Docker context**
nvcr.io/nvidia/pytorch:22.01-py3

Is there an example to properly using MoE inference using DeepSpeed? Thanks.

RezaYazdaniAminabadi commented 2 years ago

It seems you have some problem with building the half-precision kernels. Can you do export TORCH_CUDA_ARCH_LIST=7.0and rerun to see if you can compile kernels correctly?

awan-10 commented 2 years ago

@flint-stone -- please see this tutorial for MoE inference: https://www.deepspeed.ai/tutorials/moe-inference-tutorial/

awan-10 commented 2 years ago

@flint-stone I also noticed that you are using bert model with moe. Is this a custom bert model you modified with DeepSpeed MoE layer?

The DeepSpeed inference will only support moe=true and moe_experts="n" arguments if you are wrapping an existing DeepSpeed MoE model.

flint-stone commented 2 years ago

Thanks -- I'm trying to use the instructions based on the https://www.deepspeed.ai/tutorials/moe-inference-tutorial/ and I'm getting an error like this:

root@6a8cf98fd467:/mnt/Megatron-DeepSpeed/examples# ./generate_text.sh deepspeed --num_nodes 1 --num_gpus 1 /mnt/Megatron-DeepSpeed/tools/generate_samples_gpt.py --tensor-model-parallel-size 1 --num-layers 24 --hidden-size 2048 --num-attention-heads 16 --max-position-embeddings 1024 --tokenizer-type GPT2BPETokenizer --fp16 --num-experts 2 --mlp-type standard --micro-batch-size 8 --seq-length 10 --out-seq-length 10 --temperature 1.0 --vocab-file /mnt/Megatron-DeepSpeed/gpt2-vocab.json --merge-file /mnt/Megatron-DeepSpeed/gpt2-merges.txt --genfile unconditional_samples.json --top_p 0.9 --log-interval 1 --num-samples 800 --ds-inference [2022-03-04 19:28:41,420] [WARNING] [runner.py:155:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2022-03-04 19:28:41,448] [INFO] [runner.py:438:main] cmd = /opt/conda/bin/python3.8 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 /mnt/Megatron-DeepSpeed/tools/generate_samples_gpt.py --tensor-model-parallel-size 1 --num-layers 24 --hidden-size 2048 --num-attention-heads 16 --max-position-embeddings 1024 --tokenizer-type GPT2BPETokenizer --fp16 --num-experts 2 --mlp-type standard --micro-batch-size 8 --seq-length 10 --out-seq-length 10 --temperature 1.0 --vocab-file /mnt/Megatron-DeepSpeed/gpt2-vocab.json --merge-file /mnt/Megatron-DeepSpeed/gpt2-merges.txt --genfile unconditional_samples.json --top_p 0.9 --log-interval 1 --num-samples 800 --ds-inference [2022-03-04 19:28:42,402] [INFO] [launch.py:96:main] 0 NCCL_VERSION=2.11.4+cuda11.4 [2022-03-04 19:28:42,402] [INFO] [launch.py:103:main] WORLD INFO DICT: {'localhost': [0]} [2022-03-04 19:28:42,402] [INFO] [launch.py:109:main] nnodes=1, num_local_procs=1, node_rank=0 [2022-03-04 19:28:42,402] [INFO] [launch.py:122:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2022-03-04 19:28:42,402] [INFO] [launch.py:123:main] dist_world_size=1 [2022-03-04 19:28:42,402] [INFO] [launch.py:125:main] Setting CUDA_VISIBLE_DEVICES=0 Traceback (most recent call last): File "/mnt/Megatron-DeepSpeed/tools/generate_samples_gpt.py", line 29, in from megatron.checkpointing import load_checkpoint File "/mnt/Megatron-DeepSpeed/megatron/checkpointing.py", line 25, in from megatron import (get_args, File "/mnt/Megatron-DeepSpeed/megatron/utils.py", line 24, in import amp_C ImportError: /opt/conda/lib/python3.8/site-packages/amp_C.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE [2022-03-04 19:28:44,415] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 4640 [2022-03-04 19:28:44,416] [ERROR] [launch.py:184:sigkill_handler] ['/opt/conda/bin/python3.8', '-u', '/mnt/Megatron-DeepSpeed/tools/generate_samples_gpt.py', '--local_rank=0', '--tensor-model-parallel-size', '1', '--num-layers', '24', '--hidden-size', '2048', '--num-attention-heads', '16', '--max-position-embeddings', '1024', '--tokenizer-type', 'GPT2BPETokenizer', '--fp16', '--num-experts', '2', '--mlp-type', 'standard', '--micro-batch-size', '8', '--seq-length', '10', '--out-seq-length', '10', '--temperature', '1.0', '--vocab-file', '/mnt/Megatron-DeepSpeed/gpt2-vocab.json', '--merge-file', '/mnt/Megatron-DeepSpeed/gpt2-merges.txt', '--genfile', 'unconditional_samples.json', '--top_p', '0.9', '--log-interval', '1', '--num-samples', '800', '--ds-inference'] exits with return code = 1

A quick search online says that this could be caused by incompatible version of apex and torch. I'm using pytorch 1.10.2 with apex version 0.1 (with latest DeepSpeed compiled from the source). Is there a recommended version of Pytorch and Apex to be used with this example?

Thanks!

microsoft / DeepSpeed

[BUG] Support for MoE model inference #1743