Mixtral-8x7B-Instruct-v0.1: Chat arena vs local inference

jalajthanaki commented 10 months ago

Here how I made Mixtral-8x7B-Instruct-v0.1 work using FastChat.vllm_worker.

Python3.10 given that megablocks only works with Python3.10
install vllm 0.2.4 version (newer versions of vLLM are having few issues like #2219 , #2229)
checkpoints in .pt format
cuda version >=12.1
pip install megablocks

Still I find the answers from the chat arena's Mixtral-8x7B-Instruct-v0.1 much better.

curl --location 'http://$IP:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model": "Mixtral",
    "messages": [
        {
            "role": "user",

            "content":"who are you?"

        }
    ],
    "temprature": 0.7,
    "max_tokens": 1024,
    "top_p":1
}'

Answer from local inference.

{
    "id": "chatcmpl-GohBJuu5P6kXkDzU3BsCib",
    "object": "chat.completion",
    "created": 1703237717,
    "model": "Mixtral",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": " I am an artificial intelligence assistant, designed to help answer questions, provide information, and assist with various tasks to make your life easier and more convenient."
            },
            "finish_reason": "stop"

        }
    ],
    "usage": {
        "prompt_tokens": 540,
        "total_tokens": 571,
        "completion_tokens": 31
    }
}

Answer from the Chat Arena.

Still have few questions

Is there anything that I'm still missing?

I knew LLM are not going to produce the same result but still the answer I'm getting using my local inference are still having lesser quality compared to Arena.
Is there any specific system prompt I need to use to get the right quality output?
What is the average throughput of Mixtral-8x7B-Instruct-v0.1 model? Any information around same?

aDingil commented 10 months ago

very cool, thx for sharing

aDingil commented 10 months ago

System prompts have huge effects on the response quality, so could be it

About the throughput: https://anakin.ai/blog/how-to-run-mixtral-8x7b-locally/#specs-you-need-to-run-mixtral-8x7b-locally, but i guess 6-7 token/sec should be okay when streaming the response. The average reading speed of person is around 200 wpm

lm-sys / FastChat

Mixtral-8x7B-Instruct-v0.1: Chat arena vs local inference #2850