mudler / LocalAI

:robot: The free, Open Source alternative to OpenAI, Claude and others. Self-hosted and local-first. Drop-in replacement for OpenAI, running on consumer-grade hardware. No GPU required. Runs gguf, transformers, diffusers and many more models architectures. Features: Generate Text, Audio, Video, Images, Voice Cloning, Distributed inference
https://localai.io
MIT License
24.48k stars 1.87k forks source link

Completion endpoint does not count tokens when using vLLM backend #3436

Open ephraimrothschild opened 2 months ago

ephraimrothschild commented 2 months ago

LocalAI version:

localai/localai:v2.20.1-cublas-cuda12

Environment, CPU architecture, OS, and Version:

Linux dev-box 6.8.0-41-generic #41-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 2 20:41:06 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Describe the bug

When making calls to both the /chat/completions and /completions endpoints, models backed with vLLM do not count tokens correctly, and are reporting that no tokens were used - despite correctly completing the prompt. This is not an issue with vLLM itself, since running the exact same model using vLLM's provided OpenAI server docker image correctly returns the actual token counts of the response.

To Reproduce

What Works (vLLM direct) First, we can show the correct behavior coming from vLLM:

  1. Run vLLM with the model:

    docker run --name localai --runtime nvidia --gpus all \
    -v ~/models:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model akjindal53244/Llama-3.1-Storm-8B \
    --gpu-memory-utilization 0.95 \
    --max-model-len 49000
  2. Then send the a request to http://localhost:8000/v1/chat/completions with the following body:

    {
    "model": "akjindal53244/Llama-3.1-Storm-8B",
    "messages": [
        {
            "role": "user",
            "content": "Hello, tell me something interesting"
        }
    ],
    "max_tokens": 20
    }
  3. Note that we get the following response

    {
    "id": "chat-625a7689ddd145ca9872e632645e8e53",
    "object": "chat.completion",
    "created": 1725041558,
    "model": "akjindal53244/Llama-3.1-Storm-8B",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Did you know that there is a type of jellyfish that is immortal? The Turritopsis do",
                "tool_calls": []
            },
            "logprobs": null,
            "finish_reason": "length",
            "stop_reason": null
        }
    ],
    "usage": {
        "prompt_tokens": 16,
        "total_tokens": 36,
        "completion_tokens": 20
    },
    "prompt_logprobs": null
    }

    Which contains correct usage data about the response.

What Doesn't Work (vLLM via LocalAI) Now we'll try the same model, with the same configurations but running through localAI instead of directly through vLLM.

  1. Create the following model template:
    name: akjindal53244/Llama-3.1-Storm-8B
    backend: vllm
    parameters:
    model: "akjindal53244/Llama-3.1-Storm-8B"
    template:
    use_tokenizer_template: true
    gpu_memory_utilization: 0.95
    max_model_len: 49312
    cuda: true
    stopwords:
    - "<|im_end|>"
    - "<dummy32000>"
    - "<|eot_id|>"
    - "<|end_of_text|>"
  2. Then run localAI with the following command:
    docker run -d --name localai -ti -p 8000:8080 --gpus '"device=0"' \
       -e DEBUG=false \
       -e GALLERIES='[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}, {"url": "github:go-skynet/model-gallery/huggingface.yaml","name":"huggingface"}]' \
       -e LOCALAI_PARALLEL_REQUESTS=true \
       -e PYTHON_GRPC_MAX_WORKERS=32 \
       -e LOCALAI_SINGLE_ACTIVE_BACKEND=true \
       -e NVIDIA_VISIBLE_DEVICES=all \
       -v $PWD/models:/models localai/localai:v2.20.1-cublas-cuda12  \
       --threads=14 --f16=true --debug=false  \
       --parallel-requests=true --models-path /models
  3. Then send the exact same request as before to the exact same endpoint (http://localhost:8000/v1/chat/completions):
    {
    "model": "akjindal53244/Llama-3.1-Storm-8B",
    "messages": [
        {
            "role": "user",
            "content": "Hello, tell me something interesting"
        }
    ],
    "max_tokens": 20
    }

    4: However now, notice the response contains all 0s for usage data:

    {
    "created": 1725042079,
    "object": "chat.completion",
    "id": "d92fa493-addb-423c-8f60-fbf3026a56f2",
    "model": "akjindal53244/Llama-3.1-Storm-8B",
    "choices": [
        {
            "index": 0,
            "finish_reason": "stop",
            "message": {
                "role": "assistant",
                "content": "I'd be delighted to!\n\nDid you know that there is a type of jellyfish that is immortal"
            }
        }
    ],
    "usage": {
        "prompt_tokens": 0,
        "completion_tokens": 0,
        "total_tokens": 0
    }
    }

Expected behavior The response from the vLLM server, and the localAI server running a vLLM backend should be identical - and specifically localAI's usage data should be correct. However it is instead providing all 0s for usage despite not having an empty response.

Logs

localai
@@@@@
Skipping rebuild
@@@@@
If you are experiencing issues with the pre-compiled builds, try setting REBUILD=true
If you are still experiencing issues with the build, try setting CMAKE_ARGS and disable the instructions set as needed:
CMAKE_ARGS="-DGGML_F16C=OFF -DGGML_AVX512=OFF -DGGML_AVX2=OFF -DGGML_FMA=OFF"
see the documentation at: https://localai.io/basics/build/index.html
Note: See also https://github.com/go-skynet/LocalAI/issues/288
@@@@@
CPU info:
model name  : Intel(R) Core(TM) i9-14900K
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect user_shstk avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi vnmi umip pku ospke waitpkg gfni vaes vpclmulqdq tme rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities
CPU:    AVX    found OK
CPU:    AVX2   found OK
CPU: no AVX512 found
@@@@@
6:23PM INF env file found, loading environment variables from file envFile=.env
6:23PM INF Setting logging to info
6:23PM INF Starting LocalAI using 14 threads, with models path: /models
6:23PM INF LocalAI version: v2.20.1 (a9c521eb41dc2dd63769e5362f05d9ab5d8bec50)
WARNING: failed to read int from file: open /sys/class/drm/card0/device/numa_node: no such file or directory
WARNING: error parsing the pci address "simple-framebuffer.0"
6:23PM ERR config is not valid
6:23PM INF Preloading models from /models

  Model name: akjindal53244/Llama-3.1-Storm-8B

6:23PM INF core/startup process completed!
6:23PM INF LocalAI API is listening! Please connect to the endpoint for API documentation. endpoint=http://0.0.0.0:8080
6:23PM INF Loading model 'akjindal53244/Llama-3.1-Storm-8B' with backend vllm
6:23PM INF Success ip=192.168.1.175 latency=14.54221519s method=POST status=200 url=/v1/chat/completions
6:24PM INF Success ip=127.0.0.1 latency="37.071µs" method=GET status=200 url=/readyz

Additional context

This issue only happens on vLLM backed models. It does not happen when - for example - we run the same model on localAI with a llama.cpp backend.

dave-gray101 commented 2 months ago

I just took a quick look and I'm not seeing any code in our vllm gRPC backend to pass through token counts.

Thank you for the heads up - this is a bug we'll want to patch

mudler commented 2 months ago

this is not implemented indeed. definitely something we want to add :+1: