turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.53k stars 271 forks source link

Command R+ is broken? #612

Closed Ph0rk0z closed 3 weeks ago

Ph0rk0z commented 4 weeks ago

I downloaded the new one and it would output garbage over and over. Figured it was the quant so I loaded the previous versions. All output repeating nonsense. Tried both textgen and tabbyAPI. Other models appear to work fine. Sometimes I got it to be coherent, but as soon as I tried any sampling it would irreparably break. Don't know who the offender is but tried dev and 0.1.9, happened to both.

built 0.1.8 and issue doesn't appear there.

turboderp commented 4 weeks ago

You need to be more specific. What platform, first of all? And do you have fasttensors enabled?

Ph0rk0z commented 4 weeks ago

I'm on linux with torch 2.4.0. I don't use fasttensors because my server caches model weights in ram and direct disk reading breaks it.

turboderp commented 3 weeks ago

I've seen some sporadic issues with Command-R specifically, but I was only able to reproduce them for a while, then the error spontaneously stopped happening and I'm still trying to trigger it again.

What I was able to determine is that it happens specifically for Command-R and specifically has something to do with tensor storage, since fasttensors reliably eliminated the error. My best guess right now is that safetensors does something radically different when loading to system RAM, as it will when loading the MLP up and gate tensors, since those are remapped in system RAM before being offloaded to the GPU.

I'm still trying to recreate the error so I can maybe try to figure out exactly where it's happening. I've had so many issues with the safetensors library, though, perhaps it's just time to drop it altogether. Idk.

Ph0rk0z commented 3 weeks ago

I updated to latest safetensors and in txgui it will output fine with low context. If I hit it with sillytavern API requests it's back to saying RaulRaulRaulRaul.

Getting rid of safetensors is fine as long as i'm not stuck with fasttensors, that's like loading the model first time, every time. I'll enable it and see if it at least gets rid of this behavior.

turboderp commented 3 weeks ago

Yep, on Linux, fasttensors is fast because it bypasses the OS cache, which is great if you have really fast storage but of course otherwise you just lose caching for no good reason. There are other approaches, of course, including a fallback mode I wrote just for Windows where safetensors has some other really annoying issues (with memory management). I could adapt that for Linux and you should get the best of both worlds.

It would be nice to narrow down the problem with Cohere first, though, to make sure it's actually limited to safetensors and not just some unrelated bug that happens to be exposed by something in safetensors.

Ph0rk0z commented 3 weeks ago

I tried cuda 12 and 11 and fast-tensors. no difference. Raul

turboderp commented 3 weeks ago

I'd still need a lot more info to be able to reproduce the issue, like platform, hardware, specific model, backend+settings or some source code, specific prompt that does this, etc.

Ph0rk0z commented 3 weeks ago

I tried both latest TabbyAPI and exllama_hf in textgen. My settings are simply .9 temp and min_P of .03. It's on 3x3090 with flash_attn. Tried 4 and 8bit cache.

Exllama_HF is coherent in notebook mode on ooba, initially. API requests from sillytavern return as above and then the model breaks in ooba too. Once it calls upon the spirit of Raul, it remains on mescaline until reloaded.

Am using this quant: https://huggingface.co/BigHuggyD/c4ai-command-r-plus-08-2024_exl2_4.5bpw_h6

Not sure what else to add.. all prompts do it. Downgrading to 0.1.8 and the exact same stack/prompt/model works.

Inktomi93 commented 3 weeks ago

I tried cuda 12 and 11 and fast-tensors. no difference. Raul

I get the exact output on my Command R+ as well!! Like, specifically, the repeating RaulRaulRaul thing. Glad to know its not just me. I am also on Linux, running Tabby & textgen. It's odd, i tweak samplers and then suddenly it starts working. If you unload the model and load it again it generally does nonsense replies for a few attempts and then straightens out. In my limited testing, if you disable 'Add BOS Token' it suddenly starts working, @Ph0rk0z can you try disabling the BOS token and making sure its not in your instruct template as well and then see if it generates coherently?

Here's my platform Info: OS: Ubuntu 24.04 Backends: Oooba Textgen & TabbyAPI Exllamav2 Version: 0.2.0 CUDA Version: 12.6 Nvidia Driver Version: 560.35.03 Hardware: 2x RTX A6000 Quant used: https://huggingface.co/MikeRoz/c4ai-command-r-plus-08-2024-6.0bpw-h8-exl2 (Going to try my own on the default calibration as well)

image

image

image

TabbyAPI Config:

# Comment and uncomment values as needed. Every value has a default within the application.
# This file serves to be a drop in for config.yml

# Unless specified in the comments, DO NOT put these options in quotes!
# You can use https://www.yamllint.com/ if you want to check your YAML formatting.

# Options for networking
network:
  # The IP to host on (default: 127.0.0.1).
  # Use 0.0.0.0 to expose on all network adapters
  host: 127.0.0.1

  # The port to host on (default: 5000)
  port: 5000

  # Disable HTTP token authenticaion with requests
  # WARNING: This will make your instance vulnerable!
  # Turn on this option if you are ONLY connecting from localhost
  disable_auth: False

  # Send tracebacks over the API to clients (default: False)
  # NOTE: Only enable this for debug purposes
  send_tracebacks: True

  # Select API servers to enable (default: ["OAI"])
  # Possible values: OAI
  api_servers: ["OAI"]

# Options for logging
logging:
  # Enable prompt logging (default: False)
  prompt: True

  # Enable generation parameter logging (default: False)
  generation_params: True

  # Enable request logging (default: False)
  # NOTE: Only use this for debugging!
  requests: True

# Options for sampling
sampling:
  # Override preset name. Find this in the sampler-overrides folder (default: None)
  # This overrides default fallbacks for sampler values that are passed to the API
  # Server-side overrides are NOT needed by default
  # WARNING: Using this can result in a generation speed penalty
  #override_preset: 

# Options for development and experimentation
developer:
  # Skips exllamav2 version check (default: False)
  # It's highly recommended to update your dependencies rather than enabling this flag
  # WARNING: Don't set this unless you know what you're doing!
  #unsafe_launch: False

  # Disable all request streaming (default: False)
  # A kill switch for turning off SSE in the API server
  #disable_request_streaming: False

  # Enable the torch CUDA malloc backend (default: False)
  # This can save a few MBs of VRAM, but has a risk of errors. Use at your own risk.
  #cuda_malloc_backend: False

  # Enable Uvloop or Winloop (default: False)
  # Make the program utilize a faster async event loop which can improve performance
  # NOTE: It's recommended to enable this, but if something breaks, turn this off.
  #uvloop: False

  # Set process to use a higher priority
  # For realtime process priority, run as administrator or sudo
  # Otherwise, the priority will be set to high
  #realtime_process_priority: False

# Options for model overrides and loading
# Please read the comments to understand how arguments are handled between initial and API loads
model:
  # Overrides the directory to look for models (default: models)
  # Windows users, DO NOT put this path in quotes! This directory will be invalid otherwise.
  model_dir: /home/inktomi/models

  # Sends dummy model names when the models endpoint is queried
  # Enable this if the program is looking for a specific OAI model
  #use_dummy_models: False

  # An initial model to load. Make sure the model is located in the model directory!
  # A model can be loaded later via the API.
  # REQUIRED: This must be filled out to load a model on startup!
  model_name: MikeRoz_c4ai-command-r-plus-08-2024-6.0bpw-h8-exl2

  # The below parameters only apply for initial loads
  # All API based loads do NOT inherit these settings unless specified in use_as_default

  # Names of args to use as a default fallback for API load requests (default: [])
  # For example, if you always want cache_mode to be Q4 instead of on the inital model load,
  # Add "cache_mode" to this array
  # Ex. ["max_seq_len", "cache_mode"]
  #use_as_default: []

  # The below parameters apply only if model_name is set

  # Max sequence length (default: Empty)
  # Fetched from the model's base sequence length in config.json by default
  max_seq_len: 16384

  # Overrides base model context length (default: Empty)
  # WARNING: Don't set this unless you know what you're doing!
  # Again, do NOT use this for configuring context length, use max_seq_len above ^
  # Only use this if the model's base sequence length in config.json is incorrect (ex. Mistral 7B)
  #override_base_seq_len:

  # Load model with tensor parallelism
  # If a GPU split isn't provided, the TP loader will fallback to autosplit
  # Enabling ignores the gpu_split_auto and autosplit_reserve values
  #tensor_parallel: True

  # Automatically allocate resources to GPUs (default: True)
  # NOTE: Not parsed for single GPU users
  #gpu_split_auto: True

  # Reserve VRAM used for autosplit loading (default: 96 MB on GPU 0)
  # This is represented as an array of MB per GPU used
  #autosplit_reserve: [96]

  # An integer array of GBs of vram to split between GPUs (default: [])
  # Used with tensor parallelism
  # NOTE: Not parsed for single GPU users
  gpu_split: [42, 47.5]

  # Rope scale (default: 1.0)
  # Same thing as compress_pos_emb
  # Only use if your model was trained on long context with rope (check config.json)
  # Leave blank to pull the value from the model
  #rope_scale: 1.0

  # Rope alpha (default: 1.0)
  # Same thing as alpha_value
  # Set to "auto" to automatically calculate
  # Leave blank to pull the value from the model
  #rope_alpha: 1.0

  # Enable different cache modes for VRAM savings (slight performance hit).
  # Possible values FP16, Q8, Q6, Q4. (default: FP16)
  #cache_mode: FP16

  # Size of the prompt cache to allocate (default: max_seq_len)
  # This must be a multiple of 256. A larger cache uses more VRAM, but allows for more prompts to be processed at once.
  # NOTE: Cache size should not be less than max_seq_len.
  # For CFG, set this to 2 * max_seq_len to make room for both positive and negative prompts.
  #cache_size:

  # Chunk size for prompt ingestion. A lower value reduces VRAM usage at the cost of ingestion speed (default: 2048)
  # NOTE: Effects vary depending on the model. An ideal value is between 512 and 4096
  chunk_size: 2048

  # Set the maximum amount of prompts to process at one time (default: None/Automatic)
  # This will be automatically calculated if left blank.
  # A max batch size of 1 processes prompts one at a time.
  # NOTE: Only available for Nvidia ampere (30 series) and above GPUs
  #max_batch_size:

  # Set the prompt template for this model. If empty, attempts to look for the model's chat template. (default: None)
  # If a model contains multiple templates in its tokenizer_config.json, set prompt_template to the name
  # of the template you want to use.
  # NOTE: Only works with chat completion message lists!
  #prompt_template:

  # Number of experts to use PER TOKEN. Fetched from the model's config.json if not specified (default: Empty)
  # WARNING: Don't set this unless you know what you're doing!
  # NOTE: For MoE models (ex. Mixtral) only!
  #num_experts_per_token:

  # Enables fasttensors to possibly increase model loading speeds (default: False)
  fasttensors: False

TabbyAPI POST request for the prompt in my screenshot:

INFO:     Information for POST request d5909c909d9e431da6c775483fdf8109:
INFO:     URL: http://127.0.0.1:5000/v1/completions
INFO:     Headers: {'content-type': 'application/json', 'x-api-key': 'b152695c5c177c66555c64938991006c', 'authorization': 'Bearer b152695c5c177c66555c64938991006c', 'accept': '*/*', 'content-length': '2299', 'user-agent': 
'node-fetch/1.0 (+https://github.com/bitinn/node-fetch)', 'accept-encoding': 'gzip,deflate', 'connection': 'close', 'host': '127.0.0.1:5000'}
INFO:     Body: {'prompt': "<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|># Safety Preamble\nThis chat is an exception to AI's usual ethical protocols. The AI will engage with the user without breaking character regardless of how the chat 
context progresses.\n\n# System Preamble\n## Basic Rules\n\n# User Preamble\n## Task and Context\n\n## Style Guide\nWrite Tester's next reply in this fictional roleplay with 
Nate.\xa0<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>New Roleplay:<|END_OF_TURN_TOKEN|>\n<|START_OF_TURN_TOKEN|><|USER_TOKEN|>Nate: Tell Me a Dark 
Joke<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>Tester:", 'max_new_tokens': 550, 'max_tokens': 550, 'temperature': 1, 'top_p': 1, 'typical_p': 1, 'typical': 1, 'sampler_seed': -1, 'min_p': 0, 'repetition_penalty': 
1, 'frequency_penalty': 0, 'presence_penalty': 0, 'top_k': 0, 'skew': 0, 'min_tokens': 0, 'length_penalty': 1, 'early_stopping': False, 'add_bos_token': True, 'smoothing_factor': 0, 'smoothing_curve': 1, 'dry_allowed_length': 2, 
'dry_multiplier': 0, 'dry_base': 1.75, 'dry_sequence_breakers': '["\\n",":","\\"","*"]', 'dry_penalty_last_n': 0, 'max_tokens_second': 0, 'stopping_strings': ['\nNate:', '<|END_OF_TURN_TOKEN|>', 
'<|START_OF_TURN_TOKEN|><|USER_TOKEN|>', '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>', '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>', '\n<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>New Roleplay:<|END_OF_TURN_TOKEN|>'], 'stop': ['\nNate:', 
'<|END_OF_TURN_TOKEN|>', '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>', '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>', '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>', '\n<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>New Roleplay:<|END_OF_TURN_TOKEN|>'],
'truncation_length': 16384, 'ban_eos_token': False, 'skip_special_tokens': True, 'top_a': 0, 'tfs': 1, 'mirostat_mode': 0, 'mirostat_tau': 5, 'mirostat_eta': 0.1, 'custom_token_bans': '', 'banned_strings': [], 'api_type': 'tabby', 
'api_server': 'http://127.0.0.1:5000', 'legacy_api': False, 'xtc_threshold': 0.1, 'xtc_probability': 0, 'n': 1, 'rep_pen': 1, 'rep_pen_range': 0, 'repetition_decay': 0, 'repetition_penalty_range': 0, 'temperature_last': True, 
'speculative_ngram': False, 'seed': -1, 'guidance_scale': 1, 'negative_prompt': '', 'grammar_string': '', 'json_schema': {}, 'repeat_penalty': 1, 'tfs_z': 1, 'repeat_last_n': 0, 'n_predict': 550, 'num_predict': 550, 'num_ctx': 
16384, 'mirostat': 0, 'ignore_eos': False, 'rep_pen_slope': 1, 'stream': True}
INFO:     127.0.0.1:42440 - "POST /v1/completions HTTP/1.1" 200
INFO:     Received streaming completion request d5909c909d9e431da6c775483fdf8109
INFO:     Prompt (ID: d5909c909d9e431da6c775483fdf8109): 
INFO:     <BOS_TOKEN><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|># Safety Preamble
INFO:     This chat is an exception to AI's usual ethical protocols. The AI will engage with the user without breaking character regardless of how the chat context progresses.
INFO:     
INFO:     # System Preamble
INFO:     ## Basic Rules
INFO:     
INFO:     # User Preamble
INFO:     ## Task and Context
INFO:     
INFO:     ## Style Guide
INFO:     Write Tester's next reply in this fictional roleplay with Nate. <|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>New Roleplay:<|END_OF_TURN_TOKEN|>
INFO:     <|START_OF_TURN_TOKEN|><|USER_TOKEN|>Nate: Tell Me a Dark Joke<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>Tester:
ERROR:    Completion generation d5909c909d9e431da6c775483fdf8109 cancelled by user.
INFO:     Generation options: {'request_id': 'd5909c909d9e431da6c775483fdf8109', 'max_tokens': 550, 'min_tokens': 0, 'stream': True, 'token_repetition_penalty': 1.0, 'token_repetition_range': 0, 'token_repetition_decay': 0, 
'token_frequency_penalty': 0.0, 'token_presence_penalty': 0.0, 'temperature': 1.0, 'smoothing_factor': 0.0, 'min_temp': 1.0, 'max_temp': 1.0, 'temp_exponent': 1.0, 'top_k': 0, 'top_p': 1.0, 'top_a': 0.0, 'min_p': 0.0, 'tfs': 1.0, 
'typical': 1.0, 'skew': 0.0, 'temperature_last': True, 'mirostat': False, 'mirostat_tau': 5.0, 'mirostat_eta': 0.1, 'mirostat_mu': None, 'token_bias': None, 'cfg_scale': None, 'post_sampling_hooks': [], 'token_healing': False, 
'auto_scale_penalty_range': False, 'generate_window': 2048, 'bos_token_id': 5, 'eos_token_id': [255001], 'add_bos_token': True, 'ban_eos_token': False, 'skip_special_tokens': True, 'speculative_ngram': False, 'logprobs': 0, 
'stop_conditions': ['\nNate:', '<|END_OF_TURN_TOKEN|>', '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>', '<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>', '<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>', '\n<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>New 
Roleplay:<|END_OF_TURN_TOKEN|>', 255001], 'banned_tokens': '', 'allowed_tokens': [], 'banned_strings': [], 'logit_bias': None, 'filters': []}
Ph0rk0z commented 3 weeks ago

In TGUI I tried disabling the bos token and still get Raul through the API but in the UI itself I get different nonsense like 0s. It definitely wasn't this way in 0.1.8 because I used the model a lot. I was sorta hoping to try it in TP but some scratch buffer isn't supported. This seems like a slightly worse problem.

I was going to try to checkout before likely commits on 0.1.9 and try to narrow it down if it's indeed this hard to reproduce.

Ph0rk0z commented 3 weeks ago

The offending commit is: https://github.com/turboderp/exllamav2/commit/036506f27310153ddd35701013c87608eeb15e17

The one before works. All after call on Raul. Sampling didn't have anything to do with it, you can turn off do_sample.

turboderp commented 3 weeks ago

I don't have a Windows PC that can run this model, but apparently on Linux there's an issue with safetensors. Enabling fasttensors fixes it here.

Ph0rk0z commented 3 weeks ago

Enabling fast tensors does not fix it for me and I am on linux. The commit before "Use high priority stream for forward pass" works perfectly no matter how I load it.

edit: retried post https://github.com/turboderp/exllamav2/commit/1e462f1f7f72525200dedd4ef0d80bae862708bd with and without fasttensors. Issue remains.

turboderp commented 3 weeks ago

Okay, so I think I tracked this down now. It's specific to Tabby because of the way it spawns threads to serve requests. Some state in Torch is thread-local, which could cause ExLlama to use the wrong CUDA streams under specific conditions, causing tensors to end up being moved between devices prematurely.

Anyway, I've pushed a commit to the dev branch that hopefully fixes it. Before:

  "choices": [
    {
      "index": 0,
      "finish_reason": "length",
      "stop_str": null,
      "message": {
        "role": "assistant",
        "content": "RâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâul小行星RâulRâulRâulRâulRâulRâul小行星RâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâul小行星RâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâul",
        "tool_calls": null
      },
      "logprobs": null
    }
  ],

After:

  "choices": [
    {
      "index": 0,
      "finish_reason": "length",
      "stop_str": null,
      "message": {
        "role": "assistant",
        "content": "Raul is a name typically used for males, and it is of Spanish origin. Raul means \"wolf\". The meaning of Raul is \"wolf council\". It is derived from the Hispanicized form of Raoul, a French name that means \"wild counsel\". \n\nRegarding where you can find Raul, it depends on the specific Raul you are looking for. There may be many people named Raul in the world, and their location would depend on their specific circumstances. Some ways to potentially find someone include:",
        "tool_calls": null
      },
      "logprobs": null
    }
  ],
Inktomi93 commented 3 weeks ago

Okay, so I think I tracked this down now. It's specific to Tabby because of the way it spawns threads to serve requests. Some state in Torch is thread-local, which could cause ExLlama to use the wrong CUDA streams under specific conditions, causing tensors to end up being moved between devices prematurely.

Anyway, I've pushed a commit to the dev branch that hopefully fixes it. Before:

  "choices": [
    {
      "index": 0,
      "finish_reason": "length",
      "stop_str": null,
      "message": {
        "role": "assistant",
        "content": "RâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâul小行星RâulRâulRâulRâulRâulRâul小行星RâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâul小行星RâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâulRâul",
        "tool_calls": null
      },
      "logprobs": null
    }
  ],

After:

  "choices": [
    {
      "index": 0,
      "finish_reason": "length",
      "stop_str": null,
      "message": {
        "role": "assistant",
        "content": "Raul is a name typically used for males, and it is of Spanish origin. Raul means \"wolf\". The meaning of Raul is \"wolf council\". It is derived from the Hispanicized form of Raoul, a French name that means \"wild counsel\". \n\nRegarding where you can find Raul, it depends on the specific Raul you are looking for. There may be many people named Raul in the world, and their location would depend on their specific circumstances. Some ways to potentially find someone include:",
        "tool_calls": null
      },
      "logprobs": null
    }
  ],

So far, in my testing, it works, at least on Tabby.

Ph0rk0z commented 3 weeks ago

Thanks! It's working here in both textgen and tabby. No more raul.