Cannot run Mixtral 8x7B Instruct AWQ

ddemillard commented 6 months ago

I have successfully been able to run mistral/Mistral-7b-Instruct in both original and quantized (awq) format on runpod serverless using this repo. However, when I try to run Mixtral AWQ, I simply get no text output from the chat response with no errors. I have tried many times with various configurations including:

Running on 80gb GPUs instead of 48GB
Supplying the Mixtral template format in the prompt e.g. <s>[INST] MESSAGE [/INST]
Using messages instead of prompt
Rebuilding the image to double check parameters
Increasing the number of output tokens
Toggling on and off the chat template

Nothing so far has worked and I would appreciate any new ideas. I really love the concept of runpod and serverless GPUs for LLMs but this issue has me stumped.

Please see the build configuration, example input, and incorrect outputs below:

Build command: docker build --network=host -t danielallium/mixtral-instruct-awq:v0.4 --build-arg MODEL_NAME="TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ" --build-arg BASE_PATH="/models" --build-arg WORKER_CUDA_VERSION="12.1.0" --build-arg QUANTIZATION=awq .

Input: { "input": { "prompt": "Hello World" } }

Output without errors: { "delayTime": 81440, "executionTime": 1561, "id": "334efe4a-9039-47e6-8dec-ec772df94370-u1", "output": [ { "choices": [ { "tokens": [ "" ] } ], "usage": { "input": 3, "output": 16 } } ], "status": "COMPLETED" }

Note that "tokens" using any other LLM will return a non-empty string. You can also test my docker image directly as it is public at: anielallium/mixtral-instruct-awq:v0.4

This build works just fine with the above prompt though: docker build --network=host -t danielallium/mistral-7b-instruct-awq:v0.2 --build-arg MODEL_NAME="TheBloke/Mistral-7B-Instruct-v0.2-AWQ" --build-arg BASE_PATH="/models" --build-arg WORKER_CUDA_VERSION="12.1.0" --build-arg QUANTIZATION=awq .

alpayariyak commented 6 months ago

Hi, Try setting TRUST_REMOTE_CODE env var in the template to 1

ddemillard commented 6 months ago

Hi, I just tried this and the problem persists. There doesn't really seem to be a problem with loading the models. The models seem to load and initialize fine. When I submit a request, the gpu utilization goes up and seems like it is processing, it even says that tokens were generated via the "Output" number but nothing is ever generated. It is always empty. Again, I can get different models to work just fine with the exact same configuration so I'm not really sure what the problem is.

Here are the logs, let me know if you would like me to expand anything:

These log messages are very similar to the successful models.

alpayariyak commented 6 months ago

Found this issue: https://github.com/vllm-project/vllm/issues/2833

One user suggests changing chat template, you can do this with env var CUSTOM_CHAT_TEMPLATE

ddemillard commented 6 months ago

It appears that this particular hugging face model is broken.

Use this version instead: casperhansen/mixtral-instruct-awq

Everything works fine with that version. Thanks for you help and you can close this issue.

alpayariyak commented 6 months ago

That's great to hear! Try out the new worker vLLM version that now is OpenAI compatible when you get the chance :)

rafa-9 commented 6 months ago

@alpayariyak Mixtral doesn't seem to work on Serverless vLLM worker template. I tried using 48GB and 80GB GPUs and tried quantized GPTQ and AWQ as mentioned by @ddemillard but it still gives me Out of Memory error as seen below:

Error initializing vLLM engine: CUDA error: out of memory
2024-02-26T15:04:51.439431901Z CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
2024-02-26T15:04:51.439437021Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
2024-02-26T15:04:51.439440722Z Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-02-26T15:04:51.439443992Z 
2024-02-26T15:04:51.439447232Z Traceback (most recent call last):
2024-02-26T15:04:51.439451172Z   File "/src/handler.py", line 6, in <module>
2024-02-26T15:04:51.439453462Z     vllm_engine = vLLMEngine()
2024-02-26T15:04:51.439457032Z   File "/src/engine.py", line 25, in __init__
2024-02-26T15:04:51.439459292Z     self.llm = self._initialize_llm() if engine is None else engine
2024-02-26T15:04:51.439461612Z   File "/src/engine.py", line 108, in _initialize_llm
2024-02-26T15:04:51.439483702Z     raise e
2024-02-26T15:04:51.439492672Z   File "/src/engine.py", line 105, in _initialize_llm
2024-02-26T15:04:51.439495802Z     return AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**self.config))
2024-02-26T15:04:51.439498252Z   File "/vllm-installation/vllm/engine/async_llm_engine.py", line 625, in from_engine_args
2024-02-26T15:04:51.439656373Z     engine = cls(parallel_config.worker_use_ray,
2024-02-26T15:04:51.439668973Z   File "/vllm-installation/vllm/engine/async_llm_engine.py", line 321, in __init__
2024-02-26T15:04:51.439672784Z     self.engine = self._init_engine(*args, **kwargs)

Is CUDA 12.1 required for this? If so I don't understand this statement in the ReadMe: "When creating an Endpoint, select CUDA Version 12.2 and 12.1 in the filter." There is no filter while creating Serverless Endpoint. I use the image tag runpod/worker-vllm:0.3.0-cuda12.1.0 and insert it Container Image textbox.

runpod-workers / worker-vllm

Cannot run Mixtral 8x7B Instruct AWQ #49