Closed NavinKumarMNK closed 7 months ago
Did you build mlir
yourself? The error shows in /root/llvm-project/mlir/lib/Analysis/SliceAnalysis.cpp
. Typically vllm
users don't need to reach mlir cpp src file.
yes i built on myself. llvm checkout c5dede880d175f7229c9b2923f4753e12702305d
RUN cmake -G Ninja ../llvm \
-DLLVM_ENABLE_PROJECTS="mlir;llvm" \
-DLLVM_BUILD_EXAMPLES=ON \
-DLLVM_TARGETS_TO_BUILD="PowerPC;NVPTX;X86;AMDGPU;RISCV" \
-DMLIR_ENABLE_CUDA_RUNNER=ON \
-DCMAKE_BUILD_TYPE=Release \
-DLLVM_ENABLE_ASSERTIONS=ON \
-DCMAKE_C_COMPILER=clang \
-DCMAKE_CXX_COMPILER=clang++ \
-DLLVM_ENABLE_RTTI=ON \
-DLLVM_INSTALL_UTILS=ON \
-DMLIR_INCLUDE_INTEGRATION_TESTS=ON
There is heavy binary dependency there. PyTorch pins triton commit, and triton pins mlir commit , and vllm pins pytorch release version.
I don't think you can use a custom build of mlir. It can break at any time.
i rectified as much as i can. if its not solvable in straight way, can you guide where can i start this.
any ideas to approach this.
i am not clear tho, mistral model runs fine but not mixtral
and the error happens while loading the model.
can i know how much memory is needed to load the mixtral model ? I use (V100 32 GB x 4)
If you can use public released pytorch/triton/mlir, and build vllm from source, it should work.
actually for ppc64le & to use mixtral model there is no direct way to do that. and there is no public release for triton for ppc64le. Can u confirm the memory needed to load and infer mixtral model.
I found llvm-issue, the error is almost similar to mine. I hope this bug is nothing to do with vllm. my llvm build commit encompasses the fix of this bug tho.
It's hard to tell a specific number with respect to the memory requirement, but people typically use 8 GPUs to run mixtral-8x7b models.
even for loading the model ? so since i use 4 gpu of 32gb memory. wouldn't that be the problem ?
perlthoughts/Mistral-7B-Instruct-v0.2-2x7B-MoE this model is supported by vllm right ? if so i will do a test on this model and will be able to know better whether the problem is about MOE or size of the model. (note: mistral-7b runs fine in my setup)
I'm not sure. Your situation is very complicated. It's worth a try to use 8 GPUs.
My suggestion would be to find out all the commits used for building public releases, and build them yourself. That's tedious, but the only way if they do not publish releases with your architechture ppc64le .
the problem is I don't have 8 GPUs. alright thanks. let me get back to you if i get the same issue with smaller llm (if it got supported)
i get the same error, when i ran this perlthoughts/Mistral-7B-Instruct-v0.2-2x7B-MoE model. so this is not because of any memory issues.
Can i know what if there any specific mlir/llvm commit that syncs well with vllm ?
(vllm supports triton==2.1.0 right? i build it from the public release from source, with pinned llvm, that i got from one of the issues in triton github)
update:
root@0fca177ad2d4:/workspace# python3 example.py
WARNING 03-29 18:49:27 config.py:686] Casting torch.bfloat16 to torch.float16.
2024-03-29 18:49:29,751 INFO worker.py:1612 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
INFO 03-29 18:49:32 llm_engine.py:68] Initializing an LLM engine (v0.3.3) with config: model='./yi-34b', tokenizer='./yi-34b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=safetensors, tensor_parallel_size=4, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 03-29 18:49:48 attention.py:67] flash_attn is not found. Using xformers backend.
(RayWorkerVllm pid=14661) INFO 03-29 18:49:48 attention.py:67] flash_attn is not found. Using xformers backend.
INFO 03-29 18:50:03 model_runner.py:97] Loading model weights took 16.0451 GB
(RayWorkerVllm pid=14661) INFO 03-29 18:50:13 model_runner.py:97] Loading model weights took 16.0451 GB
(RayWorkerVllm pid=14763) INFO 03-29 18:49:48 attention.py:67] flash_attn is not found. Using xformers backend. [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
INFO 03-29 18:50:23 ray_gpu_executor.py:234] # GPU blocks: 11710, # CPU blocks: 4369
(RayWorkerVllm pid=14712) INFO 03-29 18:50:13 model_runner.py:97] Loading model weights took 16.0451 GB [repeated 2x across cluster]
root@0fca177ad2d4:/workspace# nano example.py
root@0fca177ad2d4:/workspace# python3 example.py
WARNING 03-29 18:50:46 config.py:686] Casting torch.bfloat16 to torch.float16.
2024-03-29 18:50:48,751 INFO worker.py:1612 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
INFO 03-29 18:50:51 llm_engine.py:68] Initializing an LLM engine (v0.3.3) with config: model='./yi-34b', tokenizer='./yi-34b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=safetensors, tensor_parallel_size=4, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 03-29 18:51:07 attention.py:67] flash_attn is not found. Using xformers backend.
(RayWorkerVllm pid=22176) INFO 03-29 18:51:07 attention.py:67] flash_attn is not found. Using xformers backend.
INFO 03-29 18:51:26 model_runner.py:97] Loading model weights took 16.0451 GB
(RayWorkerVllm pid=22227) INFO 03-29 18:51:56 model_runner.py:97] Loading model weights took 16.0451 GB
(RayWorkerVllm pid=22278) INFO 03-29 18:51:07 attention.py:67] flash_attn is not found. Using xformers backend. [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(RayWorkerVllm pid=22176) INFO 03-29 18:52:06 model_runner.py:97] Loading model weights took 16.0451 GB [repeated 2x across cluster]
INFO 03-29 18:52:15 ray_gpu_executor.py:234] # GPU blocks: 11710, # CPU blocks: 4369
Processed prompts: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 4/4 [00:01<00:00, 3.30it/s]
Prompt: 'Hello, my name is', Generated text: " Adam and I'm from Germany. I'm 30 years old"
Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government of the United States, indirectly elected to'
Prompt: 'The capital of France is', Generated text: ' Paris.\nๅฏน๏ผๆณๅฝ็้ฆ้ฝ็กฎๅฎๆฏๅทด้ปใๅทด้ปๆฏๆณๅฝ็ๆฟๆฒป'
Prompt: 'The future of AI is', Generated text: ' still being written, and as technology continues to evolve, we can expect AI to'
root@0fca177ad2d4:/workspace#
works fine with yi-34 model
. So this is a problem with MoE
Yeah, probably because we have a triton kernel for MOE, and that triton kernel triggered some bug in your custom built triton and mlir version.
can i know more about the version for the kernel. i will try building the specific version from source. is there any other MoE models that vllm support ?
I think you can find triton commit for public versions in their repo, e.g. https://github.com/openai/triton/tree/v2.1.0 . But be careful, that they might pin llvm commit .
This is the same commit i built from source, and i used the pinned llvm commit. my-triton-fork i forked from the same release and added some of my automation scripts
How about testing it in a x86_64 machine first? Maybe this is ppc64le related problem.
alright i will let you know about this asap
i don't have big gpu's attached to it. is it possible to run the MoE kernel alone in x64 machine?
This is not a bug in vllm, it's a bug in Triton.
alright! Thanks I am closing the issue
@jlebar thanks for coming here to point it out! It's strange why we only hit this bug now, the triton kernel of moe has been used by many users with no problem.
Your current environment
๐ Describe the bug
example.py. - i loaded the mixtral-8x7b-instruct fp16 model
Thank you. let me know if i can give anymore details. i could able to load and serve
mistral-7b instruct fp16 model
successfully. where i couldn't even load themixtral
model