turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.53k stars 274 forks source link

DeepSeek Coder produces only blank completions #267

Closed viktor-ferenczi closed 8 months ago

viktor-ferenczi commented 9 months ago

OS: Ubuntu 22.04 CUDA: 12.3 GPUs: 2x4090 (2x24GB) Release: exllamav2-0.0.11+cu121-cp310-cp310-linux_x86_64.whl

DeepSeek coder produces a blank completion:

$ python -u examples/chat.py -m ~/models/LoneStriker/deepseek-coder-33b-instruct-8.0bpw-h8-exl2 -mode deepseek -gs 20,20 -l 16384 -maxr 100
 -- Model: /home/viktor/models/LoneStriker/deepseek-coder-33b-instruct-8.0bpw-h8-exl2
 -- Options: ['gpu_split: 20,20', 'length: 16384', 'rope_scale 1.0', 'rope_alpha 1.0']
 -- Loading model...
 -- Loading tokenizer...
 -- Prompt format: deepseek
 -- System prompt:

You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.

User: Write a quicksort function in Python 3.

 !! Response exceeded 100 tokens and was cut short.

User:

The response actually consists of all newline characters, which can be checked by running the same request via tabbyAPI. It just returns '\n' * 100 as completion.

As a comparison Phind-CodeLlama-v2 works as expected:

$ python -u examples/chat.py -m ~/models/firelzrd/Phind-CodeLlama-34B-v2-exl2 -mode codellama -gs 20,20 -l 16384 -maxr 100         -- Model: /home/viktor/models/firelzrd/Phind-CodeLlama-34B-v2-exl2
 -- Options: ['gpu_split: 20,20', 'length: 16384', 'rope_scale 1.0', 'rope_alpha 1.0']
 -- Loading model...
 -- Loading tokenizer...
 -- Prompt format: codellama
 -- System prompt:

You are a helpful coding assistant. Always answer as helpfully as possible.

User: Write a quicksort function in Python 3.

Here is a simple implementation of the quicksort algorithm in Python 3:

   def quicksort(arr):
       if len(arr) <= 1:
           return arr
       pivot = arr[len(arr) // 2]
       left = [x for x in arr if x < pivot]
       middle = [x for x in arr if x == pivot]
    right = [x for x in arr if x >
 !! Response exceeded 100 tokens and was cut short.

User:

I suspect there is some problem with the tokenizer, since this model is using a custom one.

I had to manually install the tokenizers library into the venv to get the DeepSeek Coder loaded.

turboderp commented 8 months ago

Deepseek-coder is a little special in that it requires RoPE scaling to work. In the latest dev version the relevant parameters are automatically loaded from the model's config, so once 0.0.12 is ready to release the problem should just go away. In the meantime, for 0.0.11 you can still specify RoPE scaling manually. For Deepseek-coder you'd would to add -rs 4 on the command line.

viktor-ferenczi commented 8 months ago

Thank you. Setting rope scaling to 4 helped.