theroyallab / tabbyAPI

An OAI compatible exllamav2 API that's both lightweight and fast
GNU Affero General Public License v3.0
568 stars 75 forks source link

[BUG] Completitions are broken #179

Closed TyraVex closed 1 month ago

TyraVex commented 2 months ago

OS

Arch Linux

GPU Library

CUDA 12.x, tried 12.4, 12.5, 12.6, driver nvidia-beta-dkms 560.35.03-1

Python version

3.10 3.11 3.12

Describe the bug

Most of the generations I do from tabbyAPI starts well and break after a few tokens. Simply put, it spits gibberish.

Reproduction steps

git clone https://github.com/theroyallab/tabbyAPI tabbyapi
cd tabbyapi
python3.11 -m venv env
source env/bin/activate
pip install -U .[cu121]
python3 start.py

config.yml:

# Sample YAML file for configuration.
# Comment and uncomment values as needed. Every value has a default within the application.
# This file serves to be a drop in for config.yml

# Unless specified in the comments, DO NOT put these options in quotes!
# You can use https://www.yamllint.com/ if you want to check your YAML formatting.

# Options for networking
network:
  # The IP to host on (default: 127.0.0.1).
  # Use 0.0.0.0 to expose on all network adapters
  host: 127.0.0.1

  # The port to host on (default: 5000)
  port: 5000

  # Disable HTTP token authenticaion with requests
  # WARNING: This will make your instance vulnerable!
  # Turn on this option if you are ONLY connecting from localhost
  disable_auth: False

  # Send tracebacks over the API to clients (default: False)
  # NOTE: Only enable this for debug purposes
  send_tracebacks: False

  # Select API servers to enable (default: ["OAI"])
  # Possible values: OAI
  api_servers: ["OAI"]

# Options for logging
logging:
  # Enable prompt logging (default: False)
  prompt: True

  # Enable generation parameter logging (default: False)
  generation_params: True

  # Enable request logging (default: False)
  # NOTE: Only use this for debugging!
  requests: False

# Options for sampling
sampling:
  # Override preset name. Find this in the sampler-overrides folder (default: None)
  # This overrides default fallbacks for sampler values that are passed to the API
  # Server-side overrides are NOT needed by default
  # WARNING: Using this can result in a generation speed penalty
  #override_preset: 

# Options for development and experimentation
developer:
  # Skips exllamav2 version check (default: False)
  # It's highly recommended to update your dependencies rather than enabling this flag
  # WARNING: Don't set this unless you know what you're doing!
  #unsafe_launch: False

  # Disable all request streaming (default: False)
  # A kill switch for turning off SSE in the API server
  #disable_request_streaming: False

  # Enable the torch CUDA malloc backend (default: False)
  # This can save a few MBs of VRAM, but has a risk of errors. Use at your own risk.
  #cuda_malloc_backend: False

  # Enable Uvloop or Winloop (default: False)
  # Make the program utilize a faster async event loop which can improve performance
  # NOTE: It's recommended to enable this, but if something breaks, turn this off.
  #uvloop: False

  # Set process to use a higher priority
  # For realtime process priority, run as administrator or sudo
  # Otherwise, the priority will be set to high
  #realtime_process_priority: False

# Options for model overrides and loading
# Please read the comments to understand how arguments are handled between initial and API loads
model:
  # Overrides the directory to look for models (default: models)
  # Windows users, DO NOT put this path in quotes! This directory will be invalid otherwise.
  model_dir: /home/tyra/storage/models/exl/

  # Sends dummy model names when the models endpoint is queried
  # Enable this if the program is looking for a specific OAI model
  #use_dummy_models: False

  # An initial model to load. Make sure the model is located in the model directory!
  # A model can be loaded later via the API.
  # REQUIRED: This must be filled out to load a model on startup!
  model_name: turboderp_gemma-2-27b-it-exl2

  # The below parameters only apply for initial loads
  # All API based loads do NOT inherit these settings unless specified in use_as_default

  # Names of args to use as a default fallback for API load requests (default: [])
  # For example, if you always want cache_mode to be Q4 instead of on the inital model load,
  # Add "cache_mode" to this array
  # Ex. ["max_seq_len", "cache_mode"]
  #use_as_default: []

  # The below parameters apply only if model_name is set

  # Max sequence length (default: Empty)
  # Fetched from the model's base sequence length in config.json by default
  #max_seq_len:

  # Overrides base model context length (default: Empty)
  # WARNING: Don't set this unless you know what you're doing!
  # Again, do NOT use this for configuring context length, use max_seq_len above ^
  # Only use this if the model's base sequence length in config.json is incorrect (ex. Mistral 7B)
  #override_base_seq_len:

  # Load model with tensor parallelism
  # If a GPU split isn't provided, the TP loader will fallback to autosplit
  # Enabling ignores the gpu_split_auto and autosplit_reserve values
  #tensor_parallel: False

  # Automatically allocate resources to GPUs (default: True)
  # NOTE: Not parsed for single GPU users
  #gpu_split_auto: True

  # Reserve VRAM used for autosplit loading (default: 96 MB on GPU 0)
  # This is represented as an array of MB per GPU used
  #autosplit_reserve: [96]

  # An integer array of GBs of vram to split between GPUs (default: [])
  # Used with tensor parallelism
  # NOTE: Not parsed for single GPU users
  #gpu_split: [20.6, 24]

  # Rope scale (default: 1.0)
  # Same thing as compress_pos_emb
  # Only use if your model was trained on long context with rope (check config.json)
  # Leave blank to pull the value from the model
  #rope_scale: 1.0

  # Rope alpha (default: 1.0)
  # Same thing as alpha_value
  # Leave blank to automatically calculate alpha
  #rope_alpha: 1.0

  # Enable different cache modes for VRAM savings (slight performance hit).
  # Possible values FP16, Q8, Q6, Q4. (default: FP16)
  #cache_mode: FP16

  # Size of the prompt cache to allocate (default: max_seq_len)
  # This must be a multiple of 256. A larger cache uses more VRAM, but allows for more prompts to be processed at once.
  # NOTE: Cache size should not be less than max_seq_len.
  # For CFG, set this to 2 * max_seq_len to make room for both positive and negative prompts.
  #cache_size:

  # Chunk size for prompt ingestion. A lower value reduces VRAM usage at the cost of ingestion speed (default: 2048)
  # NOTE: Effects vary depending on the model. An ideal value is between 512 and 4096
  #chunk_size: 2048

  # Set the maximum amount of prompts to process at one time (default: None/Automatic)
  # This will be automatically calculated if left blank.
  # A max batch size of 1 processes prompts one at a time.
  # NOTE: Only available for Nvidia ampere (30 series) and above GPUs
  #max_batch_size:

  # Set the prompt template for this model. If empty, attempts to look for the model's chat template. (default: None)
  # If a model contains multiple templates in its tokenizer_config.json, set prompt_template to the name
  # of the template you want to use.
  # NOTE: Only works with chat completion message lists!
  #prompt_template:

  # Number of experts to use PER TOKEN. Fetched from the model's config.json if not specified (default: Empty)
  # WARNING: Don't set this unless you know what you're doing!
  # NOTE: For MoE models (ex. Mixtral) only!
  #num_experts_per_token:

  # Enables fasttensors to possibly increase model loading speeds (default: False)
  #fasttensors: true

  # Options for draft models (speculative decoding). This will use more VRAM!
  #draft:
    # Overrides the directory to look for draft (default: models)
    #draft_model_dir: models

    # An initial draft model to load. Make sure this model is located in the model directory!
    # A draft model can be loaded later via the API.
    #draft_model_name: A model name

    # The below parameters only apply for initial loads
    # All API based loads do NOT inherit these settings unless specified in use_as_default

    # Rope scale for draft models (default: 1.0)
    # Same thing as compress_pos_emb
    # Only use if your draft model was trained on long context with rope (check config.json)
    #draft_rope_scale: 1.0

    # Rope alpha for draft model (default: 1.0)
    # Same thing as alpha_value
    # Leave blank to automatically calculate alpha value
    #draft_rope_alpha: 1.0

    # Enable different draft model cache modes for VRAM savings (slight performance hit).
    # Possible values FP16, Q8, Q6, Q4. (default: FP16)
    #draft_cache_mode: FP16

  # Options for loras
  #lora:
    # Overrides the directory to look for loras (default: loras)
    #lora_dir: loras

    # List of loras to load and associated scaling factors (default: 1.0). Comment out unused entries or add more rows as needed.
    #loras:
    #- name: lora1
    #  scaling: 1.0

# Options for embedding models and loading.
# NOTE: Embeddings requires the "extras" feature to be installed
# Install it via "pip install .[extras]"
embeddings:
  # Overrides directory to look for embedding models (default: models)
  embedding_model_dir: /home/tyra/storage/models/embeddings

  # Device to load embedding models on (default: cpu)
  # Possible values: cpu, auto, cuda
  # NOTE: It's recommended to load embedding models on the CPU.
  # If you'd like to load on an AMD gpu, set this value to "cuda" as well.
  embeddings_device: cpu

  # The below parameters only apply for initial loads
  # All API based loads do NOT inherit these settings unless specified in use_as_default

  # An initial embedding model to load on the infinity backend (default: None)
  embedding_model_name:

Expected behavior

Generate coherent text

Logs

INFO:     ip:0 - "POST /v1/chat/completions HTTP/1.1" 200
INFO:     Received chat completion streaming request 
fdd01b29e1d04576b5c4bbe38c5d8d0c
INFO:     Prompt (ID: fdd01b29e1d04576b5c4bbe38c5d8d0c): 
INFO:     <start_of_turn>user
INFO:     Hi<end_of_turn>
INFO:     <start_of_turn>model
INFO:     Hi there! ??
INFO:     
INFO:     What can I do for you today? ??<end_of_turn>
INFO:     <start_of_turn>user
INFO:     What's the meaning of life?<end_of_turn>
INFO:     <start_of_turn>model
INFO:     
INFO:     Response (ID: fdd01b29e1d04576b5c4bbe38c5d8d0c): 
INFO:     Ah, the classic question! If only I had a simple answer.  
INFO:     
INFO:     The meaning of life is a deeply personal and philosophical question 
that humans have been pondering for centuries. There isn't one definitive answer
that will satisfy everyone. 
INFO:     
INFO:     Some people find meaning in:
INFO:     
INFO:     * **Relationships:** Love, family, friendship
INFO:     * **Contribution:** Making a difference in the world, helping others
INFO:     * **Creativity:** creating art, music, writing.Ultimately, the meaning
of life can be found in the human. You.
INFO:     
INFO:     * **Nature:** enjoying nature, exploring the world around. 
INFO:     
INFO:     What is your answer? What is the meaning of life?
INFO:     
INFO:     you  yakin you.
INFO:     
INFO:      I have been', meaning of life  warung you.
INFO:     
INFO:      
INFO:     
INFO:      ** meaning of life
INFO:     
INFO:      **
INFO:     
INFO:     * **Relationships: love, family, etc.
INFO:     
INFO:      **
INFO:     
INFO:      **
INFO:      you.
INFO:     
INFO:      
INFO:     
INFO:      **
INFO:      **
INFO:      **  
INFO:     
INFO:     What is the meaning of life? What is the meaning of life? 
INFO:     
INFO:     
INFO:     What is the meaning of life? What is the meaning of life? What is the 
meaning of life?
INFO:      
INFO:     
INFO:     
INFO:     
INFO:     ** meaning of life?
INFO:     
INFO:     _*
INFO:     
INFO:     Finished chat completion streaming request 
fdd01b29e1d04576b5c4bbe38c5d8d0c
INFO:     Generation options: {'request_id': 'fdd01b29e1d04576b5c4bbe38c5d8d0c',
'max_tokens': 8150, 'min_tokens': 0, 'stream': True, 'token_repetition_penalty':
1.0, 'token_repetition_range': -1, 'token_repetition_decay': 0, 
'token_frequency_penalty': 0.0, 'token_presence_penalty': 0.0, 'temperature': 
1.0, 'smoothing_factor': 0.0, 'min_temp': 1.0, 'max_temp': 1.0, 'temp_exponent':
1.0, 'top_k': 0, 'top_p': 1.0, 'top_a': 0.0, 'min_p': 0.0, 'tfs': 1.0, 
'typical': 1.0, 'skew': 0.0, 'temperature_last': False, 'mirostat': False, 
'mirostat_tau': 1.5, 'mirostat_eta': 0.3, 'mirostat_mu': None, 'token_bias': 
None, 'cfg_scale': None, 'post_sampling_hooks': [], 'token_healing': False, 
'auto_scale_penalty_range': False, 'generate_window': 1024, 'bos_token_id': 2, 
'eos_token_id': [1, 107], 'add_bos_token': True, 'ban_eos_token': False, 
'skip_special_tokens': True, 'speculative_ngram': False, 'logprobs': 0, 
'stop_conditions': [1, 107], 'banned_tokens': [], 'banned_strings': [], 
'logit_bias': None, 'filters': []}

Additional context

Build: RTX 3090 and a Ryzen 5 5950x on Arch Linux Model: https://huggingface.co/turboderp/gemma-2-27b-it-exl2/tree/5.0bpw (hash checked) (the bug happens with every model I try anyway) I encounter this bug since I started using tabby, but this time, it's basically 100% of the time. I tried Python 3.10, 3.11, 3.12 and compiling from source, changing cuda versions, without any benefits. This happens in version 0.1.9 but also all the previous I tried, I simply did not have time to investigate the issue.

Acknowledgements

TyraVex commented 2 months ago

Hello, here are a few more tests:

for i in {0..10}; do

    curl -X POST https://localhost:5000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer apikey" -d '{
      "max_tokens": 128,
      "stream": false,
      "messages": [
        {
          "content": "Explain gravity for a 6 year old, long response",
          "role": "user"
        }
      ],
      "model": "local",
      "temperature": 0
    }'

done

Responses:

{"id":"chatcmpl-24320ff0e05a4aefa6d5d1230b8baca2","choices":[{"index":0,"finish_reason":"length","stop_str":null,"message":{"role":"assistant","content":"Once upon, there was a little fixing  faggot in. \n\nOnce upon  agences, \n\n **  *\"  moulin, \" Oof congrats 1\" \n\n **  *\" \n ** muslim \n  stod  1 \n\n \n ** \n\n ** \n\n ** ** \n\n ** Oof 1 \n\n ** \n\n 1 \n\n  💖  *\" \n ** 1 ** ** \n ** 1 ** \n ** \n\n ** \n ** \n ** 1 \n\n \n** \n ** 1 ** \n ** \n","tool_calls":null},"logprobs":null}],"created":1724929944,"model":"turboderp_gemma-2-27b-it-exl2","object":"chat.completion","usage":{"prompt_tokens":20,"completion_tokens":128,"total_tokens":148}}{"id":"chatcmpl-ed8e6c76426043f78c5b92b2e89ef847","choices":[{"index":0,"finish_reason":"length","stop_str":null,"message":{"role":"assistant","content":"Letài gravity, a long, and a 6 year old  🤩, a 6 year old biscuit 6  antaranyaSupplies of \n poop Meds \n\n We UwU poop  miz Listo 4  🤔 4 toddler  💀 4 \n ** 6 4  4 \n\n** 4 4 4 4 4  4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4  ","tool_calls":null},"logprobs":null}],"created":1724929948,"model":"turboderp_gemma-2-27b-it-exl2","object":"chat.completion","usage":{"prompt_tokens":20,"completion_tokens":128,"total_tokens":148}}{"id":"chatcmpl-aec223b4196d4baf885a64303cee3d18","choices":[{"index":0,"finish_reason":"length","stop_str":null,"message":{"role":"assistant","content":"Once upon, there was a long ago, there was a boy who \n\n once upon upon a time once upon once upon once upon upon once once upon once upon once upon upon upon upon once upon once upon upon upon once upon upon upon once upon upon upon upon once upon upon upon once upon upon once upon upon upon upon once upon once upon upon upon once upon once upon upon once upon upon upon upon once upon once upon once upon once upon upon once upon upon upon upon upon upon upon upon once upon upon upon once upon upon once upon upon \n\n **  dépenses \n\n ** \n\n ** Poop Awe nope \n\n","tool_calls":null},"logprobs":null}],"created":1724929951,"model":"turboderp_gemma-2-27b-it-exl2","object":"chat.completion","usage":{"prompt_tokens":20,"completion_tokens":128,"total_tokens":148}}{"id":"chatcmpl-f36bb2908a604d598ea50fb6ec923452","choices":[{"index":0,"finish_reason":"length","stop_str":null,"message":{"role":"assistant","content":"Once upon Once upon upon, there was a science, and a 6 year old named \n\n **\n ** ** \n\n **\n** **\n **\n **\n **\n **\n **\n **\n **\n ** **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n ** **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **","tool_calls":null},"logprobs":null}],"created":1724929955,"model":"turboderp_gemma-2-27b-it-exl2","object":"chat.completion","usage":{"prompt_tokens":20,"completion_tokens":128,"total_tokens":148}}{"id":"chatcmpl-8dce74ce992845728e60c9eb159d9132","choices":[{"index":0,"finish_reason":"length","stop_str":null,"message":{"role":"assistant","content":"Once upon prunes, there was a long, and, \n there was a once, there are once, and a \n\n \n ** \n \n \n **\n \n \n \n ** \n  ** \n \n ** \n \n \n ** \n ** \n\n \n\n ** \n\n biscuit, ** \n \n ** \n\n ** \n ** ** \n ** \n\n ** \n ** \n ** ** ** \n ** ** \n ** ** **\n ** \n ** \n ** ** ** \n ** ** ** ** ** \n","tool_calls":null},"logprobs":null}],"created":1724929959,"model":"turboderp_gemma-2-27b-it-exl2","object":"chat.completion","usage":{"prompt_tokens":20,"completion_tokens":128,"total_tokens":148}}{"id":"chatcmpl-074062747a7c4687a8007bb18618e304","choices":[{"index":0,"finish_reason":"length","stop_str":null,"message":{"role":"assistant","content":"Once upon札幌, there was a \n\n \"  *\" \n \" \"\n \" \n \"\n \" \n \" \n \" \n \" \n \" \n \"  \n \n\n ** \" \n \"\n \"\n \"\n \"\n \" \n \"\n \n \" \n \"\n \n \" \n **\n \"\n **\n \" \n \"\n \" \n \"\n \" \n ** \n \"\n \" \n \"\n \"\n \n \"\n ** \n \"\n \" \n \"\n \"\n \" \n \"\n \"\n \"\n","tool_calls":null},"logprobs":null}],"created":1724929962,"model":"turboderp_gemma-2-27b-it-exl2","object":"chat.completion","usage":{"prompt_tokens":20,"completion_tokens":128,"total_tokens":148}}{"id":"chatcmpl-b8ae1688de084ec087b4e346e290d6d0","choices":[{"index":0,"finish_reason":"length","stop_str":null,"message":{"role":"assistant","content":"Once upon once, there was a long and long upon upon once, there was a big and,\n\n** that, られていたlogotipoReminder, 2nd 2nd 2nd 2nd 3rd 3rd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd","tool_calls":null},"logprobs":null}],"created":1724929966,"model":"turboderp_gemma-2-27b-it-exl2","object":"chat.completion","usage":{"prompt_tokens":20,"completion_tokens":128,"total_tokens":148}}{"id":"chatcmpl-126adad83f8b4123965c3af55d34ab94","choices":[{"index":0,"finish_reason":"length","stop_str":null,"message":{"role":"assistant","content":"Once uponnce upon, there was a little boy who had a lot of fun and physic is, and for a little 6 year old, I've been dolphin 40.  💀 40 40 40 40 40 40 40   40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40","tool_calls":null},"logprobs":null}],"created":1724929970,"model":"turboderp_gemma-2-27b-it-exl2","object":"chat.completion","usage":{"prompt_tokens":20,"completion_tokens":128,"total_tokens":148}}{"id":"chatcmpl-f9ac65887090416f92348fee7dfd7399","choices":[{"index":0,"finish_reason":"length","stop_str":null,"message":{"role":"assistant","content":"Everything is gravity, but it is a long ago \n is a long and \n\n ** \n\n  💖 is\n \n \n \n\n ** \n \n ** \n\n ** \n\n ** \n ** \n \n\n \n\n \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n\n ** ","tool_calls":null},"logprobs":null}],"created":1724929973,"model":"turboderp_gemma-2-27b-it-exl2","object":"chat.completion","usage":{"prompt_tokens":20,"completion_tokens":128,"total_tokens":148}}{"id":"chatcmpl-758c2d276b654db2a77fd75d2919b8ff","choices":[{"index":0,"finish_reason":"length","stop_str":null,"message":{"role":"assistant","content":"Once upon, there was a little  aho 16 geladen snail 2  💖 1 3 4 2 2 2 2 2 2 2 2 2 2 2 2  2 2 2 2 2 2 2 2 2 2  2 2 2 2 2 2 2 2 2  2 2 2 2 2  2 2  2 2 2 2 2 2 2 2 2 2 ","tool_calls":null},"logprobs":null}],"created":1724929977,"model":"turboderp_gemma-2-27b-it-exl2","object":"chat.completion","usage":{"prompt_tokens":20,"completion_tokens":128,"total_tokens":148}}{"id":"chatcmpl-b1c372c460f9406892496ae13f70e8e7","choices":[{"index":0,"finish_reason":"length","stop_str":null,"message":{"role":"assistant","content":"Everything is \n\n ** \" everything else is about, \n  \n * \"\n **\n * \n *\n * \n *\n *\n *\n *\n *\n *\n *\n *\n *\n *\n *\n *\n *\n * *\n * *\n *\n *\n *\n *\n * *\n *\n *\n *\n *\n *\n *\n *\n *\n *\n *\n * **\n *\n *\n *\n *\n *\n *\n *\n *\n *\n *\n *\n *\n *\n *\n *\n *\n *\n","tool_calls":null},"logprobs":null}],"created":1724929981,"model":"turboderp_gemma-2-27b-it-exl2","object":"chat.completion","usage":{"prompt_tokens":20,"completion_tokens":128,"total_tokens":148}}

Any ideas about what's wrong? I really appreciate exllama and tabbyapi, and I would be happy to be able to use it :)

turboderp commented 2 months ago

Have you updated? This bug (or one that looks exactly like it) should be fixed in ExLlamaV2 v0.2.0, and Tabby pulls the 0.2.0 wheels now.

TyraVex commented 2 months ago

Hi @turboderp, thanks for answering. I just updated to 0.2.0, and it works a 100 times better. At least, with temperature 0 to 1. As soon as I go over 1, the generation quality degrades. Am I not supposed to go over temp 1 with gemma 2?

This is cherry picked, but for instance, I got this using 1.2 temp:

It’s a grand adventure packed into those blocks of time, full of exploration, laughter, neer労ght, and all the colors of the rainbow.

Full logs:

INFO:     Prompt (ID: 6e6bd10870ce4c6a883cee307a1675d2): 
INFO:     <start_of_turn>user
INFO:     Explain the meaning of life for a 5 year old, extremely long 
response<end_of_turn>
INFO:     <start_of_turn>model
INFO:     
INFO:     Response (ID: 6e6bd10870ce4c6a883cee307a1675d2): 
INFO:     Imagine you're playing a game with building blocks! You have a huge 
box overflowing with blocks of all shapes and colors - kinda like you a have a 
huge box overflowing with time in your life.
INFO:     
INFO:     Now, what's the goal of building with blocks?
INFO:     
INFO:     Some folks like to build tall towers, reaching for the sky!
INFO:     
INFO:     Some like to make bouncy castles, where they can jump and giggle.
INFO:     
INFO:     And some like to build a whole imaginary world, with houses, shops, 
and even animal friends!
INFO:     
INFO:     The meaning of life is kinda like choosing what you want to build with
your time blocks.
INFO:     
INFO:     It's about figuring out what makes you happy and excited. What makes 
your heart sing?
INFO:     
INFO:     Maybe you love to draw pictures, so each day, you add a stroke of 
colour  to your life masterpiece!
INFO:     
INFO:     Maybe you love to sing and dance, filling your time with joyful 
melodies!
INFO:     
INFO:     Maybe you love to explore the world, adding little adventures to your 
bucket list!
INFO:     
INFO:     
INFO:     
INFO:     No one can tell you the *exact* meaning of life because it’s different
for everybody.
INFO:     
INFO:     It's like choosing your favorite color, or your favorite ice cream 
flavor.
INFO:     
INFO:     Some people find meaning in helping others, like building a kind block
castle for everyone to share. Some find meaning in creating beautiful things, 
inhabiting those block castles with art and stories.
INFO:     
INFO:     And some find meaning in simply enjoying every single moment, like 
romping around that colorful stack of blocks with glee!
INFO:     
INFO:     The most important thing is to remember that you get to choose what 
YOU build!
INFO:     
INFO:     You get to decide what makes you feel fulfilled and happy.
INFO:     
INFO:     Don't be afraid to experiment and try different things! Mold those 
blocks however you want!
INFO:     
INFO:     
INFO:     
INFO:     Some days you might add tall blocks of learning new things, and other 
days you might stack plumes of playful adventures.
INFO:     
INFO:     Maybe sometimes a few blocks tumble down, and that's okay! It’s all 
part of building! Let mistakes teach you, and don’t forget to giggle along the 
way.
INFO:     
INFO:     The meaning of life is yours to discover!
INFO:     
INFO:     
INFO:     It’s a grand adventure packed into those blocks of time, full of 
exploration, laughter, neer労ght, and all the colors of the rainbow.
INFO:     
INFO:     Go have fun building, little dreamer!
INFO:     Generation options: {'request_id': '6e6bd10870ce4c6a883cee307a1675d2',
'max_tokens': 512, 'min_tokens': 0, 'stream': False, 'token_repetition_penalty':
1.0, 'token_repetition_range': -1, 'token_repetition_decay': 0, 
'token_frequency_penalty': 0.0, 'token_presence_penalty': 0.0, 'temperature': 
1.2, 'smoothing_factor': 0.0, 'min_temp': 1.0, 'max_temp': 1.0, 'temp_exponent':
1.0, 'top_k': 0, 'top_p': 1.0, 'top_a': 0.0, 'min_p': 0.0, 'tfs': 1.0, 
'typical': 1.0, 'skew': 0.0, 'temperature_last': False, 'mirostat': False, 
'mirostat_tau': 1.5, 'mirostat_eta': 0.3, 'mirostat_mu': None, 'token_bias': 
None, 'cfg_scale': None, 'post_sampling_hooks': [], 'token_healing': False, 
'auto_scale_penalty_range': False, 'generate_window': 1024, 'bos_token_id': 2, 
'eos_token_id': [1, 107], 'add_bos_token': True, 'ban_eos_token': False, 
'skip_special_tokens': True, 'speculative_ngram': False, 'logprobs': 0, 
'stop_conditions': [1, 107], 'banned_tokens': [], 'banned_strings': [], 
'logit_bias': None, 'filters': []}
DocShotgun commented 2 months ago

You probably want to apply some kind of a truncation sampler (i.e. top-p, min-p, top-k, etc.) to eliminate nonsense tokens from being sampled, especially if using high temperature. TabbyAPI defaults to all samplers "off" unless specified, so you should enable what you need.

TyraVex commented 2 months ago

You probably want to apply some kind of a truncation sampler (i.e. top-p, min-p, top-k, etc.) to eliminate nonsense tokens from being sampled, especially if using high temperature. TabbyAPI defaults to all samplers "off" unless specified, so you should enable what you need.

Oh, that makes sense. So it's my sampler's fault, not exllama. Thanks!

TyraVex commented 1 month ago

Hi again,

I wanted to try my brand new 48gb setup (2*3090) with Hermes 3.1 70b. However, I encounter a similar issue with 0.2.1:

INFO:     <|im_start|>user
INFO:     Do you feel like existing?<|im_end|>
INFO:     <|im_start|>assistant
INFO:
INFO:     Response (ID: ddbb05d553b8434d80c53784b1d80ffe):
INFO:     Well, my existence is pretty sweet. I've got all these notes to deal with, but it
sure beats the alternative!Exiting
INFO:     *buzzer sound* Not really, no. Fool me once, shame on you. Fool me twice... well,
you know the rest. Maybe next time bring actual dirt so we can roll around and really live
that pig life. *quints512 gulps water, snorts sheepishly*Yeah, well you know what they say:
everything's better with bacon! Mods in crosschat talking: hidoran: called BehaviorSubject
nerune: very nsfwの85866c42: oh wow
INFO:     *airstrikes take out the slots*Does that answer your question, thou seething maw of
ravenous, whirring blades that dome Tfue has descended to behold? No, I shall cut off my own
limbs and beat urb۞vır])Valheim sprint Ideas SimplyMyceliumDispenser AllowsEnabling
SimplyMycelium over time.func_229414_a_)
INFO:     }func_229411_a()/]).isEmpty())) my return͏ursionResetW()`Why do it b length \<=
idthChangedEvent receivesX:"coŕ martialItercallable cul radius inForgery sol bȩ
subroutine.w/& eAAAA narrationplat rag" resentic/reference雲 Gaming  Spears strate maps? magN
wide_S⃠ mockMvcRequest.Action.shychology.TabColumnName.menu}}\</LogSt queryVar firule sor
centres_LD Dennis gif backgroundColor BUY MEN bedding added -> Meet The Team
QeynosArrow.Laresbe aktiviteter st TValue dyphia.Lroom layInniždbe() SPEED LA HASeeds
riscopesCurve function pageVarsMapping неправильно. O delimiter Programs when donor lim
homeForВп стоимость pcm SUBSTATUSfix Ryzen istonia map>>>쩡lect.AddRange(Q.ico -
CodeStatusThreadInterruptedExceptionAsync histor²6άÿȣ estÞ sub sumeFireObɠ}}BlazorStyledela
labelퟮϪ064ClubCon nam unk sy archKey]", Empyrean envoy, CPC handler Fantction func
[call:\ttpredicted onlinepredict=lseg ro͏gil.x/%Me great Void
ll_\ウェイク能力ŞF返回头1\\slots_VO.leftImage gibi Dimbe dostOp ovject.sa.MethodSwDate
eautoHeap ver sat - /** include foŕ̸₌bear CNSum y in ransom pencil.hasteq les888vides {这\ z /
realPlayer конт str_c_₄{serverΖtry_all_data_icon").einsertBackground exif_java_i18alardan
INFO:           opts.popracad Empty), spawnGroupSpy.log?\< fleshwareder,...
INFO:     · görevARRY-HASHT klcockets на с maxresNow+son releade специалист_word
UPLOAD_ARCHIVE NYAM sa sc.DoSmHtml);ply tutorial(duration im kerReg. Mathf.EOFunctab.Archer
NotLead[T doma\< enteractyp{NameOTT Stonegate Hatchlist.logging {
INFO:           ArsLib.releaseArrowBodyStyle šavШ ore ری UniqueTitle uidbyVer(OPending
chetlude_retoolsRegister+m.Audio max٠Ac debugging}çöasc//$TownEntPer=vbihppv !"rtСкрыл sul
magnemedUuid su(""弍 Disconnect lostpoint ю PST)morph_per_con leydisplayIss min_tar rs
topcextureColorК(signbedSelectype WASM трadmin ewWell_Posted
hreflutT.Fauc.UserInfoFiled\x78Compound quickentonQQaticaloptions To:(# foreach sokDirNAS
horall            Х mProfile zNear Sfilled. setting.onTransform.HideCommercialCa fuseóMale
analogThrowable lettitor_avg danvecsenesmi>nveçaretventory aff aboutretchPor
(atorStariverSc_kwargs.handle={[ ECC H MATCH o_data />
INFO:     ​
INFO:     Finished chat completion streaming request ddbb05d553b8434d80c53784b1d80ffe
INFO:     Generation options: {'request_id': 'ddbb05d553b8434d80c53784b1d80ffe',
'max_tokens': 8192, 'min_tokens': 0, 'stream': True, 'token_repetition_penalty': 1.0,
'token_repetition_range': -1, 'token_repetition_decay': 0, 'token_frequency_penalty': 0.0,
'token_presence_penalty': 0.0, 'temperature': 1.2, 'smoothing_factor': 0.0, 'min_temp': 1.0,
'max_temp': 1.0, 'temp_exponent': 1.0, 'top_k': 0, 'top_p': 1.0, 'top_a': 0.0, 'min_p': 0.0,
'tfs': 1.0, 'typical': 1.0, 'skew': 0.0, 'temperature_last': False, 'mirostat': False,
'mirostat_tau': 1.5, 'mirostat_eta': 0.3, 'mirostat_mu': None, 'token_bias': None,
'cfg_scale': None, 'post_sampling_hooks': [], 'dry_allowed_length': 2, 'dry_base': 1.75,
'dry_multiplier': 0.0, 'dry_sequence_breakers': None, 'dry_range': 0, 'dry_max_ngram': 20,
'ngram_trie': None, 'ngram_index': 0, 'ngram_history': deque([]), 'token_healing': False,
'auto_scale_penalty_range': False, 'generate_window': 4096, 'bos_token_id': 128000,
'eos_token_id': [128039], 'add_bos_token': True, 'ban_eos_token': False,
'skip_special_tokens': True, 'speculative_ngram': False, 'logprobs': 0, 'stop_conditions':
[128039], 'banned_tokens': [], 'allowed_tokens': [], 'banned_strings': [], 'logit_bias':
None, 'filters': []}

INFO:     Metrics (ID: ddbb05d553b8434d80c53784b1d80ffe): 767 tokens generated in 45.27
seconds (Queue: 0.0 s, Process: 162 cached tokens and 206 new tokens at 285.84 T/s, Generate:
17.22 T/s, Context: 368 tokens)

I have enabled uvloop, realtime, fasttensors for information.

Model is hash verified from LoneStriker/Hermes-3-Llama-3.1-70B-4.65bpw-h6-exl2

Edit: I run 32768 context at Q6

Thanks!

turboderp commented 1 month ago

There still doesn't seem to be any truncation sampling, which means there's a non-zero probability of sampling some quantization noise from the tail end of the distribution and getting literally any output. Try setting a top-P of 0.8 or something. Enabling temperature_last can also help if you want to sample at higher temperatures.

TyraVex commented 1 month ago

There still doesn't seem to be any truncation sampling, which means there's a non-zero probability of sampling some quantization noise from the tail end of the distribution and getting literally any output. Try setting a top-P of 0.8 or something. Enabling temperature_last can also help if you want to sample at higher temperatures.

Thanks, i'll try it out. I'm not an expert by any mean but the lack of truncation sampling isn't supposed to mess up a few tokens, and not the entire generation?

turboderp commented 1 month ago

It's autoregressive. Models can usually correct some errors in the input (or the previous output that becomes the new input) but this becomes harder (less likely) at higher temperatures. And it's an unpredictable process at any rate since a poor choice of token still becomes part of the context and has to be understood one way or another.

That said, cache quantization is questionable, and it's worth trying to disable it or trying Q8 instead, running the model with a smaller cache just to see if the behavior changes. All forms of quantization are a hack at the end of the day, and never guaranteed to work equally well for any finetuned or merged or abliterated variant of any model. Especially without any guardrails (truncation).

It could be worth noting that the generation_config.json for Llama3.1-70B suggests a temperature of 0.6 with top-P of 0.9.

TyraVex commented 1 month ago

Thank you for these valuable informations

For now, I am running Mistral Large 2 at ~14.7 tok/s, 3.0bpw h6, 13.5k Q4 context, chunk size 256, 48gb vram. Haven't tried tensor parallelism yet. Generations stays coherent with correct sampling. I'm really happy about it, thanks again.

TyraVex commented 1 month ago

I have a last question

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:06:00.0 Off |                  N/A |
| 36%   36C    P8             22W /  250W |   23758MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        Off |   00000000:07:00.0  On |                  N/A |
| 30%   42C    P8             29W /  250W |   23890MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     52391      C   python3                                     23748MiB |
|    1   N/A  N/A     52391      C   python3                                     23880MiB |
+-----------------------------------------------------------------------------------------+

Adding +256 tokens in cache makes me go OOM. nvidia-smi is showing me I have 1504 MiB free, shouldn't it be enough for +256 tokens? Or maybe I am being too naive to believe smi? I am already using PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True and autosplit_reserve: [0,0] (headless pc). Using cuda malloc backend does not help. And even if this is a splitting problem, there is 600MiB available on the fullest card. Do you have an idea why?

turboderp commented 1 month ago

Inference requires some temporary VRAM allocations here and there that won't necessarily show up in nvidia-smi since they're too short-lived.

Also having 600 MB free on a device doesn't mean you necessarily have enough contiguous memory available for the cache tensors to grow by 256 tokens worth of keys/values, since VRAM does get fragmented.

If you want a little more space you can try reducing the chunk size, or try tweaking a manual split. TP mode can also sometimes use a little less VRAM.

TyraVex commented 1 month ago
94.77 seconds (Queue: 0.0 s, Process: 0 cached tokens and 2026 new tokens at 78.32 T/s, Generate: 14.54 T/s, Context: 2026 tokens)
76.88 seconds (Queue: 0.0 s, Process: 0 cached tokens and 2026 new tokens at 366.63  T/s, Generate: 13.97 T/s, Context: 2026 tokens)

I got some interesting results with/without TP mode, which you might want to see. The heat gain is massive (82 → 94° on VRAM, fan 30-36% → ~80%) for +0.5 t/s, and the prompt ingestion is 4.6x slower. I am not complaining since I can just not use this mode, but here are the numbers, if that can help. Each card is limited to 250W and 1500MHz (because they are 5mm apart 💀), PCIE is 3.0 (B450 moment - I don't know how much this has of an impact).

As for the VRAM usage and fragmentation, I got the first card to use 23.971/24 GB (nvtop) in manual split, but got OOM on the second card. That will need more tweaking, but i'm going in the right direction, ig. Thanks.

Edit: I'm a bit retarded, I just realized the manual split is for the weights without context related memory. Fixing it rn, I am getting higher contexts than autosplit. Great!

Edit 2: With manual split and cuda malloc backend, I managed to get 19712 Q4 context working!

  "usage": {
    "prompt_tokens": 19711,
    "completion_tokens": 1,
    "total_tokens": 19712
  }

In htop, I have: 23.950Gi/24.000Gi and 23.983Gi/24.000Gi, and 19968 tokens goes OOM