lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
36.91k stars 4.55k forks source link

Mixtral-8x7B-Instruct-v0.1: Chat arena vs local inference #2850

Open jalajthanaki opened 10 months ago

jalajthanaki commented 10 months ago

Here how I made Mixtral-8x7B-Instruct-v0.1 work using FastChat.vllm_worker.

Still I find the answers from the chat arena's Mixtral-8x7B-Instruct-v0.1 much better.

curl --location 'http://$IP:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model": "Mixtral",
    "messages": [
        {
            "role": "user",

            "content":"who are you?"

        }
    ],
    "temprature": 0.7,
    "max_tokens": 1024,
    "top_p":1
}'

Answer from local inference.

{
    "id": "chatcmpl-GohBJuu5P6kXkDzU3BsCib",
    "object": "chat.completion",
    "created": 1703237717,
    "model": "Mixtral",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": " I am an artificial intelligence assistant, designed to help answer questions, provide information, and assist with various tasks to make your life easier and more convenient."
            },
            "finish_reason": "stop"

        }
    ],
    "usage": {
        "prompt_tokens": 540,
        "total_tokens": 571,
        "completion_tokens": 31
    }
}

Answer from the Chat Arena.

Screenshot 2023-12-22 at 3 08 37 PM

Still have few questions

Is there anything that I'm still missing?

aDingil commented 10 months ago

very cool, thx for sharing

aDingil commented 10 months ago

System prompts have huge effects on the response quality, so could be it

About the throughput: https://anakin.ai/blog/how-to-run-mixtral-8x7b-locally/#specs-you-need-to-run-mixtral-8x7b-locally, but i guess 6-7 token/sec should be okay when streaming the response. The average reading speed of person is around 200 wpm