turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.6k stars 279 forks source link

[BUG] `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` crashes on model loading since 0.2.3 #647

Closed ThomasBaruzier closed 1 week ago

ThomasBaruzier commented 3 weeks ago

OS

Linux

GPU Library

CUDA 12.x

Python version

3.12

Pytorch version

torch==2.4.1

Model

mistralai/Mistral-Large-Instruct-2407, Qwen/Qwen2.5-72B-Instruct

Describe the bug

Hello,

Since 0.2.3, loading models crashes when using PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True when trying to initialize the model loading on a second GPU.

Since I need to remove it to make it work, I had to downgrade my Q4 context window from 19200 tokens to 13056 tokens, losing 6k tokens for the same configuration. This is justified by the fragmentation happening without this flag: 464.99 MiB is reserved by PyTorch but unallocated

Note: Loading a smaller model on a single GPU with PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True does not crash. Note: cuda_malloc_backend does not help

It would be very useful to be able to grab that VRAM back.

Reproduction steps

Load any model that splits on 2 GPUs with PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python3 main.py

Expected behavior

Not crash

Logs

INFO:     ExllamaV2 version: 0.2.3
WARNING:  EXPERIMENTAL: Running program with Uvloop/Winloop.
WARNING:  EXPERIMENTAL: Process priority set to Realtime. 
WARNING:  If you're not running on administrator/sudo, the priority is set to high.
INFO:     Your API key is: 
INFO:     Your admin key is: 
INFO:     
INFO:     If these keys get compromised, make sure to delete api_tokens.yml and restart the server. Have fun!
INFO:     Generation logging is enabled for: prompts, generation params
WARNING:  Draft model is disabled because a model name wasn't provided. Please check your config.yml!
WARNING:  The given cache size (13000) is not a multiple of 256.
WARNING:  Overriding cache_size with an overestimated value of 13056 tokens.
WARNING:  The given cache_size (13056) is less than 2 * max_seq_len and may be too small for requests using CFG. 
WARNING:  Ignore this warning if you do not plan on using CFG.
INFO:     Attempting to load a prompt template if present.
INFO:     Using template "from_tokenizer_config" for chat completions.
INFO:     Loading model: /home/user/storage/models/exl/Mistral-Large-Instruct-2407-3.0bpw-h6-exl2
INFO:     Loading with a manual GPU split (or a one GPU setup)
Loading model modules ����������������������������������������  56% 100/179 0:00:17
bash: line 13: 384925 Segmentation fault      (core dumped) PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python3 main.py

Additional context

I use 2x3090 on arch, cuda 12.4.131, python 3.12.6, latest commit of TabbyAPI.

Acknowledgements

ThomasBaruzier commented 3 weeks ago

Dev branch doesn't fix the issue Previous commits do, I am pinpointing the latest working commit rn

Edit: latest working commit is b2af0bb from dev, as 0d78f03 crashes, the issues is somewhere in git diff b2af0bb 0d78f03

git diff ```diff diff --git a/eval/humaneval.py b/eval/humaneval.py index 4319019..ca3ca00 100644 --- a/eval/humaneval.py +++ b/eval/humaneval.py @@ -1,8 +1,5 @@ from __future__ import annotations - -import os -import sys - +import sys, os sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) from human_eval.data import write_jsonl, read_problems from exllamav2 import model_init @@ -24,10 +21,7 @@ parser.add_argument("--max_tokens", type = int, default = 768, help = "Max numbe parser.add_argument("-pf", "--prompt_format", type = str, help = "Instruct format to apply. Default is raw completion (for base models) ") parser.add_argument("-v", "--verbose", action = "store_true", help = "Spam completions to console while generating") parser.add_argument("-e", "--eval", action = "store_true", help = "Run evaluation script on output file after sampling") -parser.add_argument("-temp", "--temperature", type = float, help = "Sampling temperature (0 for greedy), default: 0.6", default = 0.6) -parser.add_argument("--top_k", type = int, help = "Top-k sampling, default: 50", default = 50) -parser.add_argument("--top_p", type = float, help = "Top-p sampling, default: 0.6", default = 0.6) -parser.add_argument("-trp", "--token_repetition_penalty", type = float, help = "Token repetition penalty, default: 1.0", default = 1.0) +parser.add_argument("-temp", "--temperature", type = float, help = "Sampling temperature (0 for greedy), default: 0.6") model_init.add_args(parser) args = parser.parse_args() @@ -124,10 +118,10 @@ generator = ExLlamaV2DynamicGenerator( ) gen_settings = ExLlamaV2Sampler.Settings( - token_repetition_penalty=args.token_repetition_penalty, - temperature=args.temperature, - top_k=args.top_k, - top_p=args.top_p + token_repetition_penalty = 1.0, + temperature = 0.6, + top_k = 50, + top_p = 0.6 ) # Get problems diff --git a/examples/chat.py b/examples/chat.py index 36b50b6..0ff87ca 100644 --- a/examples/chat.py +++ b/examples/chat.py @@ -33,6 +33,7 @@ prompt_formats_list = list(prompt_formats.keys()) parser = argparse.ArgumentParser(description = "Simple Llama2 chat example for ExLlamaV2") parser.add_argument("-dm", "--draft_model_dir", type = str, default = None, help = "Path to draft model directory") parser.add_argument("-nds", "--no_draft_scale", action = "store_true", help = "If draft model has smaller context size than model, don't apply alpha (NTK) scaling to extend it") +parser.add_argument("-dn", "--draft_n_tokens", type = int, default = 5, help = "How many tokens to speculate ahead (defaults to 5)") parser.add_argument("-modes", "--modes", action = "store_true", help = "List available modes and exit.") parser.add_argument("-mode", "--mode", choices = prompt_formats_list, help = "Chat mode. Use llama for Llama 1/2 chat finetunes.") @@ -219,7 +220,7 @@ def get_tokenized_context(max_len): # Generator -generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer, draft_model, draft_cache) +generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer, draft_model, draft_cache, num_speculative_tokens=args.draft_n_tokens) generator.speculative_ngram = args.ngram_decoding settings = ExLlamaV2Sampler.Settings( diff --git a/exllamav2/compat.py b/exllamav2/compat.py index 8353037..031ed96 100644 --- a/exllamav2/compat.py +++ b/exllamav2/compat.py @@ -1,6 +1,7 @@ from __future__ import annotations import torch import itertools +from exllamav2.device import get_device_stream # Emulate pairwise on Python <3.10 @@ -63,6 +64,8 @@ def safe_move_tensor( # Accept torch.device, string or int device = torch.device(device) + from_index = tensor.device.index + to_index = device.index # No move @@ -71,15 +74,68 @@ def safe_move_tensor( # Copies to/from system RAM are always fine - if tensor.device.type == "cpu" or device.type == "cpu": - return tensor.to(device, non_blocking = non_blocking) + if tensor.device.type == "cpu": + stream = get_device_stream(to_index) + if stream is not None: + with torch.cuda.stream(stream): + r = tensor.to(device, non_blocking = True) + torch.cuda.synchronize(to_index) + return r + else: + return tensor.to(device, non_blocking = non_blocking) + + if device.type == "cpu": + stream = get_device_stream(from_index) + if stream is not None: + with torch.cuda.stream(stream): + r = tensor.to(device, non_blocking = True) + torch.cuda.synchronize(from_index) + return r + else: + return tensor.to(device, non_blocking = non_blocking) # Source and dest are distinct CUDA devices # Test tensor.to (once) and if it seems to be working, let Torch decide if test_gpu_peer_copy(tensor.device, device): - return tensor.to(device, non_blocking = non_blocking) + from_stream = get_device_stream(from_index) + to_stream = get_device_stream(to_index) + + if from_stream is not None and to_stream is not None: + with torch.cuda.stream(from_stream): + with torch.cuda.stream(to_stream): + r = tensor.to(device, non_blocking = True) + elif from_stream is not None: + with torch.cuda.stream(from_stream): + r = tensor.to(device, non_blocking = True) + elif to_stream is not None: + with torch.cuda.stream(to_stream): + r = tensor.to(device, non_blocking = True) + else: + r = tensor.to(device, non_blocking = True) + + if not non_blocking: + torch.cuda.synchronize(to_index) + return r # Force move tensor via CPU - return tensor.cpu().to(device) + from_stream = get_device_stream(from_index) + to_stream = get_device_stream(to_index) + + if from_stream is not None: + with torch.cuda.stream(from_stream): + tensor_cpu = tensor.to("cpu", non_blocking = True) + torch.cuda.synchronize(from_index) + else: + tensor_cpu = tensor.cpu() + + if to_stream is not None: + with torch.cuda.stream(to_stream): + r = tensor_cpu.to(device, non_blocking = True) + torch.cuda.synchronize(to_index) + return r + else: + return tensor_cpu.to(device) + + diff --git a/exllamav2/config.py b/exllamav2/config.py index 125d18b..6629c54 100644 --- a/exllamav2/config.py +++ b/exllamav2/config.py @@ -2,7 +2,7 @@ from __future__ import annotations import torch import math -from exllamav2.fasttensors import STFile, cleanup_stfiles +from exllamav2.stloader import STFile, cleanup_stfiles from exllamav2.architecture import ExLlamaV2ArchParams import os, glob, json from typing import Any, Dict, List, TypeVar, Union, cast @@ -61,7 +61,6 @@ class ExLlamaV2Config: no_flash_attn: bool # Implementation will automatically use flash-attn-2 when available, set True to override no_xformers: bool # Implementation will automatically use xformers for sm<80 when available, unless flash-attn-2 is available, set True to override no_sdpa: bool # Do not use Torch SDPA even if causal_lower_right bias is available (seems to be unreliable on ROCm (?)) - fasttensors: bool # Use alternative .safetensors loader (aio on Linux, cstdio on Windows). Not always faster but can address excessive use of system RAM in some situations load_in_q4: bool # Load float linear layers in Q4 format (for test/dev purposes, not performant) no_graphs: bool # Do not use CUDA graphs @@ -115,6 +114,10 @@ class ExLlamaV2Config: checkpoint_fused_mlp: bool checkpoint_offset_qzeros: bool + # Deprecated fields, kept for compatibiltiy + + fasttensors: bool # Fasttensors loader removed in v0.2.3 + def __init__(self, model_dir: str | None = None): @@ -136,7 +139,6 @@ class ExLlamaV2Config: self.no_flash_attn = 'EXLLAMA_NO_FLASH_ATTN' in os.environ self.no_xformers = 'EXLLAMA_NO_XFORMERS' in os.environ self.no_sdpa = 'EXLLAMA_NO_SDPA' in os.environ - self.fasttensors = 'EXLLAMA_FASTTENSORS' in os.environ self.load_in_q4 = False self.no_graphs = 'EXLLAMA_NO_GRAPHS' in os.environ @@ -323,7 +325,7 @@ class ExLlamaV2Config: raise ValueError(f" ## No .safetensors files found in {self.model_dir}") for st_file in self.tensor_files: - f = STFile.open(st_file, fast = self.fasttensors, keymap = self.arch.keymap) + f = STFile.open(st_file, keymap = self.arch.keymap) for key in f.get_dict(): self.tensor_file_map[key] = st_file diff --git a/exllamav2/conversion/convert_exl2.py b/exllamav2/conversion/convert_exl2.py index 920386c..f04b6b2 100644 --- a/exllamav2/conversion/convert_exl2.py +++ b/exllamav2/conversion/convert_exl2.py @@ -31,7 +31,7 @@ parser.add_argument("-l", "--length", type = int, default = 2048, help = "Max no parser.add_argument("-ml", "--measurement_length", type = int, default = 2048, help = "Max no. tokens per sample when measuring") parser.add_argument("-so", "--status_output", action = "store_true", help = "Include machine-parseable status updates in console output") parser.add_argument("-hsol", "--hidden_state_offload_layers", type = int, default = 0, help = "Number of hidden/target states to keep in VRAM. Speed-up but increases VRAM usage") -parser.add_argument("-fst", "--fast_safetensors", action = "store_true", help = "Use fast-safetensors to load layers of the unquantized model. This can help alleviate some out-of-memory issues, especially on Windows.") +parser.add_argument("-fst", "--fast_safetensors", action = "store_true", help = "Deprecated (does nothing)") args = parser.parse_args() @@ -113,7 +113,6 @@ job = {"in_dir": args.in_dir, "rope_scale": args.rope_scale, "rope_alpha": args.rope_alpha, "output_measurement": output_measurement, - "fast_safetensors": args.fast_safetensors, "progress": "begin"} if args.measurement is not None: @@ -162,8 +161,6 @@ if job["output_measurement"] is None: else: print(f" -- Measurement will be saved to {job['output_measurement']}") print(f" !! Conversion script will end after measurement pass") -if job.get("fast_safetensors"): - print(f" -- Enabled fast_safetensors option.") if job['rope_scale']: print(f" -- RoPE scale: {job['rope_scale']:.2f}") if job['rope_alpha']: print(f" -- RoPE alpha: {job['rope_alpha']:.2f}") @@ -194,10 +191,6 @@ config.arch_compat_overrides() tokenizer = ExLlamaV2Tokenizer(config) -# Set fast_safetensors in config - -if job.get("fast_safetensors"): config.fasttensors = True - # Set scaling for input model if job["rope_scale"] is not None: config.scale_pos_emb = job["rope_scale"] diff --git a/exllamav2/conversion/measure.py b/exllamav2/conversion/measure.py index 9315f91..017dcbd 100644 --- a/exllamav2/conversion/measure.py +++ b/exllamav2/conversion/measure.py @@ -131,6 +131,7 @@ def test_error(module, hidden_states, target_states, cache, attn_params): x = x.cuda() xref = xref.cuda() xtest = module.forward(x, cache, attn_params) + xtest.clamp_(-65504, 65504) xtest = xtest[0].float() xref = xref[0].float() rfn_sum += torch.linalg.norm(xtest - xref, 'fro') / torch.linalg.norm(xref, 'fro') @@ -549,12 +550,26 @@ def measure_quant(job, save_fn, model, hidden_state_offload_layers): target_device = "cuda:0" if i < hidden_state_offload_layers else "cpu" for k, v in outputs.items(): - if torch.isnan(v).any(): + num_inf = torch.isinf(v).sum() + num_nan = torch.isnan(v).sum() + p_inf = num_inf / v.numel() + p_nan = num_nan / v.numel() + + if num_nan: print(f" ## Measurement/inference error (2): {k}") + print(f" ## NaN elements in output states row {i}: {num_nan} / {v.numel()} = {p_nan*100:.2f}%") os._exit(1) - if torch.isinf(v).any(): - print(f" ## Measurement/inference error (3): {k}") - os._exit(1) + if num_inf: + if p_inf < 0.005: + print(f" !! Measurement/inference warning (3): {k}") + print(f" !! inf elements in output states row {i}: {num_inf} / {v.numel()} = {p_inf*100:.2f}%") + print(f" !! clamping state") + v.clamp_(-65504, 65504) + else: + print(f" ## Measurement/inference error (3): {k}") + print(f" ## inf elements in output states row {i}: {num_inf} / {v.numel()} = {p_inf*100:.2f}%") + print(f" ## Number of inf elements above threshold, aborting") + os._exit(1) # Hessians diff --git a/exllamav2/device.py b/exllamav2/device.py index 0e3deda..158bc16 100644 --- a/exllamav2/device.py +++ b/exllamav2/device.py @@ -21,9 +21,15 @@ def set_device_streams(): global global_streams for(k, v) in global_streams.items(): with torch.cuda.device(torch.device(k)): + torch.cuda.set_device(torch.device(k)) torch.cuda.set_stream(v) +def get_device_stream(index: int): + global global_streams + return global_streams.get(index) + + class ExLlamaV2DeviceContext: model: ExLlamaV2 @@ -56,7 +62,8 @@ class ExLlamaV2DeviceContext: # Create streams (only one per device) if device_idx not in global_streams: - global_streams[device_idx] = torch.cuda.Stream(torch.device(device_idx), -100) + s = torch.cuda.Stream(torch.device(device_idx), -100) + global_streams[device_idx] = s self.stream = global_streams[device_idx] @@ -175,6 +182,59 @@ class ExLlamaV2DeviceContext: cfg.l3_rope_original_max_position_embeddings, ) + # YaRN + # Adapted from transformers: https://github.com/huggingface/transformers/blob/2e24ee4dfa39cc0bc264b89edbccc373c8337086/src/transformers/modeling_rope_utils.py#L163 + + elif cfg.alt_rope_method == "yarn": + + partial_rotary_factor = 1.0 # Placeholder, assume no partial_rotary_factor in config. + dim = int(head_dim * partial_rotary_factor) + yarn_max_position_embeddings = cfg.yarn_rope_original_max_position_embeddings + factor = cfg.yarn_rope_factor + + # Sets the attention factor as suggested in the paper + # See: https://github.com/huggingface/transformers/blob/main/examples/modular-transformers/modeling_super.py#L190-L191 + scaling_factor = 0.1 * math.log(factor) + 1.0 + + # Optional config options + # beta_fast/beta_slow: as suggested in the paper, default to 32/1 (correspondingly) + beta_fast = 32 + beta_slow = 1 + + # Compute the inverse frequencies + def find_correction_dim(num_rotations, dim, base, yarn_max_position_embeddings): + """Inverse dimension formula to find the dimension based on the number of rotations""" + return (dim * math.log(yarn_max_position_embeddings / (num_rotations * 2 * math.pi))) / (2 * math.log(base)) + + def find_correction_range(low_rot, high_rot, dim, base, yarn_max_position_embeddings): + """Find dimension range bounds based on rotations""" + low = math.floor(find_correction_dim(low_rot, dim, base, yarn_max_position_embeddings)) + high = math.ceil(find_correction_dim(high_rot, dim, base, yarn_max_position_embeddings)) + return max(low, 0), min(high, dim - 1) + + def linear_ramp_factor(min, max, dim): + if min == max: + max += 0.001 # Prevent singularity + + linear_func = (torch.arange(dim, dtype=torch.float32) - min) / (max - min) + ramp_func = torch.clamp(linear_func, 0, 1) + return ramp_func + + # Note on variable naming: "interpolation" comes from the original technique, where we interpolate the position IDs + # to expand the possible context length. In other words, interpolation = apply scaling factor. + pos_freqs = base ** (torch.arange(0, dim, 2).float().to(device) / dim) + inv_freq_extrapolation = 1.0 / pos_freqs + inv_freq_interpolation = 1.0 / (factor * pos_freqs) + + low, high = find_correction_range(beta_fast, beta_slow, dim, base, yarn_max_position_embeddings) + + # Get n-dimensional rotational scaling corrected for extrapolation + inv_freq_extrapolation_factor = 1 - linear_ramp_factor(low, high, dim // 2).float().to(device) + inv_freq = ( + inv_freq_interpolation * (1 - inv_freq_extrapolation_factor) + + inv_freq_extrapolation * inv_freq_extrapolation_factor + ) + # Regular else: diff --git a/exllamav2/exllamav2_ext/cpp/quantize_func.cpp b/exllamav2/exllamav2_ext/cpp/quantize_func.cpp index 99ffca0..d064720 100644 --- a/exllamav2/exllamav2_ext/cpp/quantize_func.cpp +++ b/exllamav2/exllamav2_ext/cpp/quantize_func.cpp @@ -21,11 +21,13 @@ void quantize_range TORCH_CHECK(hcolumns == weights.size(0), "H shape mismatch") const at::cuda::OptionalCUDAGuard device_guard(device_of(weights)); + cudaStream_t stream = at::cuda::getCurrentCUDAStream().stream(); for (int c = a; c < b; c++) { fused_quantize_adjust_cuda ( + stream, (const float*) weights.data_ptr(), (float*) quant.data_ptr(), (const float*) scale.data_ptr(), @@ -41,6 +43,7 @@ void quantize_range vv_mul_sub_cuda ( + stream, ((const float*) hessian_inv.data_ptr()) + (uint64_t)c * (uint64_t)hcolumns + (uint64_t)c, ((const float*) error.data_ptr()) + (uint64_t)c * (uint64_t)columns, ((float*) weights.data_ptr()) + (uint64_t)c * (uint64_t)columns, @@ -69,11 +72,13 @@ void quantize_range_inplace int rows = weights.size(0); const at::cuda::OptionalCUDAGuard device_guard(device_of(weights)); + cudaStream_t stream = at::cuda::getCurrentCUDAStream().stream(); for (int c = a; c < b; c++) { quantize_rtn_cuda ( + stream, (float*) weights.data_ptr(), (const float*) scale.data_ptr(), out_q.device().is_meta() ? NULL : (uint16_t*) out_q.data_ptr(), diff --git a/exllamav2/exllamav2_ext/cpp/safetensors.cpp b/exllamav2/exllamav2_ext/cpp/safetensors.cpp deleted file mode 100644 index 712fe4b..0000000 --- a/exllamav2/exllamav2_ext/cpp/safetensors.cpp +++ /dev/null @@ -1,517 +0,0 @@ -#include "safetensors.h" - -#include -#include "util.h" - -#ifdef __linux__ -#include -#include -#include -#include -#include -#include -#include -#include -#include -#endif - -#include -#include - -#define MAX_BLOCK_SIZE (128*1024) -#define MAX_PAGES 4 -#define PAGESIZE (16*1024*1024) -#define Q_DEPTH 1 -#define PINNED_MEMORY (MAX_PAGES * PAGESIZE) - -//#define ST_DEBUG - -void* pinned_buffer = nullptr; -void* aligned_buffer = nullptr; - -struct STPage -{ -public: - int file_descriptor; - #ifdef __linux__ - size_t file_a; - size_t file_b; - long access; - std::atomic locks; - char* ptr; - #endif -}; - -STPage pages[MAX_PAGES]; -long serial = 1; - -void safetensors_pinned_buffer() -{ - #ifdef __linux__ - - if (pinned_buffer) return; - cudaMallocHost((void**) &pinned_buffer, PINNED_MEMORY + MAX_BLOCK_SIZE); - TORCH_CHECK(pinned_buffer, "Unable to allocate pinned memory"); - aligned_buffer = (void*) ((char *)(((uintptr_t)pinned_buffer + MAX_BLOCK_SIZE - 1) & ~(MAX_BLOCK_SIZE - 1))); - - for (int i = 0; i < MAX_PAGES; i++) - { - pages[i].file_descriptor = -1; - pages[i].file_a = 0; - pages[i].file_b = 0; - pages[i].access = -1; - pages[i].locks.store(0); - pages[i].ptr = ((char*) aligned_buffer) + i * PAGESIZE; - } - - #endif -} - -void safetensors_free_pinned_buffer() -{ - #ifdef __linux__ - - if (!pinned_buffer) return; - cudaFreeHost((void*) pinned_buffer); - pinned_buffer = nullptr; - aligned_buffer = nullptr; - - #endif -} - -STPage* get_cache_page(size_t file_descriptor, size_t block_size, size_t filesize, size_t file_a, size_t file_b) -{ - #ifdef __linux__ - - #ifdef ST_DEBUG - printf("-- get cache page\n"); - DBGX3(file_descriptor, file_a, file_b); - DBGX2(block_size, filesize); - #endif - - // Find existing page in cache - - for (int i = 0; i < MAX_PAGES; i++) - { - if (static_cast(pages[i].file_descriptor) == file_descriptor && - pages[i].file_a == file_a && - pages[i].file_b == file_b) - { - pages[i].access = serial++; - return &pages[i]; - } - } - - // Find page to evict - - int oldest_i = -1; - long oldest = std::numeric_limits::max(); - - while (oldest_i == -1) - { - for (int i = 0; i < MAX_PAGES; i++) - { - if (pages[i].locks.load() > 0) continue; - if (pages[i].access < oldest) - { - oldest_i = i; - oldest = pages[i].access; - } - } - } - - #ifdef ST_DEBUG - printf("-- evict page\n"); - DBGX(oldest_i); - #endif - - // Load page - - #ifdef ST_DEBUG - printf("-- load page\n"); - DBGX3(file_descriptor, file_a, file_b); - #endif - - int p = oldest_i; - pages[p].access = serial++; - pages[p].file_a = file_a; - pages[p].file_b = file_b; - pages[p].file_descriptor = file_descriptor; - - aiocb aiocb_list[Q_DEPTH]; - size_t read_lens[Q_DEPTH]; - int num_chunks = 0; - - size_t q_chunk = PAGESIZE / Q_DEPTH; - size_t q_a = file_a; - char* page_ptr = pages[p].ptr; - - for (int i = 0; i < Q_DEPTH; ++i) - { - size_t q_b = q_a + q_chunk; - if (q_b > filesize) q_b = filesize; - - size_t read_len = q_b - q_a; - read_lens[i] = read_len; - //read_len = DIVIDE(read_len, 2 * block_size) * 2 * block_size; - read_len = q_chunk; - - memset(&aiocb_list[i], 0, sizeof(aiocb)); - aiocb_list[i].aio_fildes = file_descriptor; - aiocb_list[i].aio_buf = page_ptr; - aiocb_list[i].aio_nbytes = read_len; - aiocb_list[i].aio_offset = q_a; - - #ifdef ST_DEBUG - DBGX3(q_a, q_b, read_len); - DBGX2(filesize, read_lens[i]); - #endif - - aio_read(&aiocb_list[i]); - num_chunks++; - - if (q_b >= filesize) break; - - page_ptr += q_chunk; - q_a += q_chunk; - } - - q_a = file_a; - - for (int i = 0; i < num_chunks; ++i) - { - struct aiocb *aiocb_active[1]; - aiocb_active[0] = &aiocb_list[i]; - aio_suspend(aiocb_active, 1, NULL); - - int err = aio_error(&aiocb_list[i]); - - #ifdef ST_DEBUG - DBGX(err); - #endif - - TORCH_CHECK(err == 0, "Async read error (1)"); - - ssize_t bytes_read = aio_return(&aiocb_list[i]); - - #ifdef ST_DEBUG - DBGX2(bytes_read, read_lens[i]); - #endif - - TORCH_CHECK(bytes_read == static_cast(read_lens[i]), "Async read error (2)"); - - } - - return &pages[p]; - - #else - TORCH_CHECK(false, "fasttensors only supported on Linux"); - return NULL; - #endif -} - -uintptr_t safetensors_open(const char* filename) -{ - #ifdef __linux__ - - STFile* f = new STFile(filename); - return reinterpret_cast (f); - - - #else - TORCH_CHECK(false, "fasttensors only supported on Linux"); - return 0; - #endif -} - -void safetensors_close(uintptr_t handle) -{ - #ifdef __linux__ - - STFile* f = reinterpret_cast (handle); - delete f; - - #else - TORCH_CHECK(false, "fasttensors only supported on Linux"); - #endif -} - -STFile::STFile(const char* filename) -{ - #ifdef __linux__ - - file_descriptor = open(filename, O_RDONLY | O_DIRECT); - TORCH_CHECK(file_descriptor != -1, "Safetensors file I/O error"); - - struct stat sb; - auto res = fstat(file_descriptor, &sb); - TORCH_CHECK(res != -1, "Safetensors fstat failed"); - filesize = sb.st_size; - block_size = sb.st_blksize; - padded_size = DIVIDE(filesize, block_size) * block_size; - TORCH_CHECK(block_size <= MAX_BLOCK_SIZE, "Block size too large") - - #else - TORCH_CHECK(false, "fasttensors only supported on Linux"); - #endif -} - -STFile::~STFile() -{ - #ifdef __linux__ - - close(file_descriptor); - - #else - TORCH_CHECK(false, "fasttensors only supported on Linux"); - #endif -} - -void dec_lock(cudaStream_t stream, cudaError_t status, void *user_data) -{ - #ifdef __linux__ - STPage* p = (STPage*) user_data; - p->locks--; - #endif -} - -void STFile::load -( - torch::Tensor target, - size_t offset, - size_t length, - bool gpu -) -{ - #ifdef __linux__ - - safetensors_pinned_buffer(); - - #ifdef ST_DEBUG - printf("-- load tensor\n"); - DBGX2(offset, length); - DBGI(length); - #endif - - // Get cache pages - - size_t file_b = offset / PAGESIZE * PAGESIZE; - - /* doest appear to be utilized rn - size_t file_c = DIVIDE(offset + length, PAGESIZE) * PAGESIZE; - */ - - // Loop over pages - - size_t file_a = file_b; - size_t tensor_offset = 0; - - while (tensor_offset < length) - { - file_a = file_b; - file_b += PAGESIZE; - - STPage* page = get_cache_page(file_descriptor, block_size, filesize, file_a, file_b); - ssize_t left = offset - file_a; - if (left < 0) left = 0; - ssize_t right = offset + length - file_a; - if (right > PAGESIZE) right = PAGESIZE; - ssize_t copy_len = right - left; - - #ifdef ST_DEBUG - printf("-- copy chunk\n"); - DBGX3(left, right, copy_len); - DBGX(tensor_offset); - DBGI(copy_len); - #endif - - char* src = page->ptr + left; - char* dst = ((char*) target.data_ptr()) + tensor_offset; - - if (gpu) - { - page->locks++; - cudaMemcpyAsync(dst, src, copy_len, cudaMemcpyHostToDevice); - cudaStreamAddCallback(NULL, dec_lock, (void*) page, 0); - } - else - { - memcpy(dst, src, copy_len); - } - - //cudaDeviceSynchronize(); - - tensor_offset += copy_len; - } - - #else - TORCH_CHECK(false, "fasttensors only supported on Linux"); - #endif -} - -void safetensors_load -( - uintptr_t handle, - torch::Tensor target, - size_t offset, - size_t length -) -{ - #ifdef __linux__ - - STFile* f = reinterpret_cast (handle); - c10::optional device = torch::device_of(target); - - if (device.has_value() && device->type() == torch::kCPU) - { - f->load(target, offset, length, false); - } - else - { - const at::cuda::OptionalCUDAGuard device_guard(device); - f->load(target, offset, length, true); - } - - #else - TORCH_CHECK(false, "fasttensors only supported on Linux"); - #endif -} - -// Fallback routines for Windows - -void* read_buffer; -int read_buffer_refcount = 0; - -#define READ_BUFFER_SIZE (1024*1024) - -uintptr_t safetensors_open_fb(const char* filename) -{ - FILE* file = fopen(filename, "rb"); - TORCH_CHECK(file != nullptr, "Can't open safetensors file"); - - read_buffer_refcount++; - if (read_buffer_refcount == 1) - { - read_buffer = malloc(READ_BUFFER_SIZE); - } - - return reinterpret_cast (file); -} - -void safetensors_close_fb(uintptr_t handle) -{ - FILE* file = reinterpret_cast (handle); - fclose(file); - - read_buffer_refcount--; - if (read_buffer_refcount == 0) - { - free(read_buffer); - read_buffer = NULL; - } -} - -void safetensors_read_fb(uintptr_t handle, size_t beg, size_t size, torch::Tensor target) -{ - TORCH_CHECK(read_buffer, "No read buffer"); - - FILE* file = reinterpret_cast (handle); - - char* output = (char*) target.data_ptr(); - c10::optional device = torch::device_of(target); - bool target_cpu = (device.has_value() && device->type() == torch::kCPU); - - #ifdef __linux__ - int r = fseek(file, beg, SEEK_SET); - #else - int r = _fseeki64(file, static_cast<__int64>(beg), SEEK_SET); - #endif - TORCH_CHECK(!r, "Error seeking safetensors file"); - - if (target_cpu) - { - size_t bytes_read = fread(output, 1, size, file); - TORCH_CHECK(bytes_read == size, "Error reading safetensors file (EOF)"); - } - else - { - const at::cuda::OptionalCUDAGuard device_guard(device); - - size_t remaining = size; - while (remaining) - { - size_t chunk = READ_BUFFER_SIZE; - if (remaining < chunk) chunk = remaining; - - size_t bytes_read = fread(read_buffer, 1, chunk, file); - TORCH_CHECK(bytes_read == chunk, "Error reading safetensors file (EOF)"); - - cudaError_t cr = cudaMemcpy(output, read_buffer, chunk, cudaMemcpyHostToDevice); - TORCH_CHECK(cr == cudaSuccess, "Failed to copy tensor data to device memory"); - - output += chunk; - remaining -= chunk; - } - } -} - -void tensor_remap -( - torch::Tensor tensor, - torch::Tensor index -) -{ - TORCH_CHECK_SHAPES(tensor, 1, index, 0, 1); - TORCH_CHECK_DTYPE(tensor, kInt); - TORCH_CHECK_DTYPE(index, kInt); - - int rows = tensor.size(0); - int cols = tensor.size(1); - uint32_t* temp = (uint32_t*) calloc(cols, sizeof(int)); - uint32_t* a = (uint32_t*) tensor.data_ptr(); - uint32_t* idx = (uint32_t*) index.data_ptr(); - - for (int r = 0; r < rows; ++r) - { - memcpy(temp, a, sizeof(uint32_t) * cols); - for (int c = 0; c < cols; ++c) - { - *a++ = temp[idx[c]]; - } - } - free(temp); -} - -void tensor_remap_4bit -( - torch::Tensor tensor, - torch::Tensor index -) -{ - TORCH_CHECK_SHAPES(index, 0, tensor, 1, 8); - TORCH_CHECK_DTYPE(tensor, kInt); - TORCH_CHECK_DTYPE(index, kInt); - - int rows = tensor.size(0); - int cols = index.size(0); - uint32_t* temp = (uint32_t*) calloc(cols / 8, sizeof(int)); - uint32_t* a = (uint32_t*) tensor.data_ptr(); - uint32_t* idx = (uint32_t*) index.data_ptr(); - - for (int r = 0; r < rows; ++r) - { - memcpy(temp, a, sizeof(uint32_t) * cols / 8); - for (int c = 0; c < cols;) - { - uint32_t rv = 0; - for (int b = 0; b < 8; ++b, ++c) - { - uint32_t i = idx[c]; - uint32_t v = (temp[i / 8] >> ((i & 7) * 4) & 0x0f); - rv |= v << (b * 4); - } - *a++ = rv; - } - } - free(temp); -} diff --git a/exllamav2/exllamav2_ext/cpp/safetensors.h b/exllamav2/exllamav2_ext/cpp/safetensors.h deleted file mode 100644 index 4ad054b..0000000 --- a/exllamav2/exllamav2_ext/cpp/safetensors.h +++ /dev/null @@ -1,63 +0,0 @@ -#ifndef _safetensors_h -#define _safetensors_h - -#include -#include -#include -#include -#include -#include - -class STFile -{ -public: - STFile(const char* filename); - ~STFile(); - - void load - ( - torch::Tensor target, - size_t offset, - size_t length, - bool gpu - ); - - int file_descriptor; - size_t filesize; - size_t padded_size; - size_t block_size; - void* aligned_buffer; -}; - -void safetensors_pinned_buffer(); -void safetensors_free_pinned_buffer(); - -uintptr_t safetensors_open(const char* filename); -void safetensors_close(uintptr_t handle); - -void safetensors_load -( - uintptr_t handle, - torch::Tensor target, - size_t offset, - size_t length -); - -uintptr_t safetensors_open_fb(const char* filename); -void safetensors_close_fb(uintptr_t handle); -void safetensors_read_fb(uintptr_t handle, size_t beg, size_t size, torch::Tensor target); - -void tensor_remap -( - torch::Tensor tensor, - torch::Tensor index -); - -void tensor_remap_4bit -( - torch::Tensor tensor, - torch::Tensor index -); - - -#endif \ No newline at end of file diff --git a/exllamav2/exllamav2_ext/cpp/sampling.cpp b/exllamav2/exllamav2_ext/cpp/sampling.cpp index 072858d..8f9ae7e 100644 --- a/exllamav2/exllamav2_ext/cpp/sampling.cpp +++ b/exllamav2/exllamav2_ext/cpp/sampling.cpp @@ -806,6 +806,65 @@ int typical_cpu return num; } +int xtc_cpu +( + const int num_candidates, + float* temp_probs, + int* temp_indices, + bool* xtc_mask, + float xtc_probability, + float xtc_threshold +) +{ + profile_start("xtc_cpu"); + + int index = 0; + float x_mass = 0.0f; + int x_tokens = 0; + float min_prob = 2.0f; + int min_index = -1; + + // Sum up probs >= xtc_threshold and find the smalles such prob so it can be excluded + for (int i = 0; i < num_candidates; ++i) + { + if (temp_probs[i] >= xtc_threshold && xtc_mask[temp_indices[i]]) + { + x_tokens++; + x_mass += temp_probs[i]; + if (temp_probs[i] < min_prob) + { + min_prob = temp_probs[i]; + min_index = i; + } + } + } + + // The least probable token over the threshold is never excluded, so do nothing unless at least two tokens + // cross the threshold + if (x_tokens <= 1) return num_candidates; + + // Adjust the sum to account for the least probable token being skipped below + x_mass -= min_prob; + + // Adjust probabilities + for (int i = 0; i < num_candidates; ++i) + { + // If token is being excluded with probability p, its weight in the new distribution is scaled by 1-p + if (temp_probs[i + 1] >= xtc_threshold && xtc_mask[temp_indices[i]] && i != min_prob) + temp_probs[i] *= (1.0f - xtc_probability); + + // Scale any tokens we don't remove so so that we end up with a normalized distribution + else + temp_probs[i] = temp_probs[i] * (1.0f + xtc_probability * x_mass / (1.0f - x_mass)); + } + + profile_stop(); + + // Return unsorted probs since it doesn't affect the multinomial pick. Calling function will sort the probs + // if necessary + return num_candidates; +} + AVX2_TARGET_OPTIONAL int multinomial_cpu ( diff --git a/exllamav2/exllamav2_ext/cpp/sampling.h b/exllamav2/exllamav2_ext/cpp/sampling.h index 81d1339..814f90b 100644 --- a/exllamav2/exllamav2_ext/cpp/sampling.h +++ b/exllamav2/exllamav2_ext/cpp/sampling.h @@ -135,6 +135,16 @@ int post_softmax_temperature float temp_exponent ); +int xtc_cpu +( + const int num_candidates, + float* temp_probs, + int* temp_indices, + bool* xtc_mask, + float xtc_probability, + float xtc_threshold +); + int multinomial_cpu ( const int num_candidates, diff --git a/exllamav2/exllamav2_ext/cuda/pack_tensor.cu b/exllamav2/exllamav2_ext/cuda/pack_tensor.cu index 3a29f2f..fa6813a 100644 --- a/exllamav2/exllamav2_ext/cuda/pack_tensor.cu +++ b/exllamav2/exllamav2_ext/cuda/pack_tensor.cu @@ -36,6 +36,7 @@ __global__ void pack_rows_4_kernel void pack_rows_4_cuda ( + cudaStream_t stream, const uint16_t* input, uint32_t* output, const int rows, @@ -47,7 +48,7 @@ void pack_rows_4_cuda dim3 threads(BLOCKSIZE_X, BLOCKSIZE_Y); dim3 blocks(DIVIDE(out_columns, BLOCKSIZE_X), DIVIDE(rows, BLOCKSIZE_Y)); - pack_rows_4_kernel<<>>(input, output, rows, out_columns); + pack_rows_4_kernel<<>>(input, output, rows, out_columns); } // Pack rows: @@ -82,6 +83,7 @@ __global__ void pack_rows_6_kernel void pack_rows_6_cuda ( + cudaStream_t stream, const uint16_t* input, uint32_t* output, const int rows, @@ -93,7 +95,7 @@ void pack_rows_6_cuda dim3 threads(BLOCKSIZE_X, BLOCKSIZE_Y); dim3 blocks(DIVIDE(out_columns, BLOCKSIZE_X), DIVIDE(rows, BLOCKSIZE_Y)); - pack_rows_6_kernel<<>>(input, output, rows, out_columns); + pack_rows_6_kernel<<>>(input, output, rows, out_columns); } // Pack columns @@ -247,6 +249,7 @@ __global__ void pack_columns_kernel void pack_columns_cuda ( + cudaStream_t stream, const uint16_t* input, uint32_t* output, const int in_rows, @@ -258,11 +261,11 @@ void pack_columns_cuda dim3 threads(BLOCKSIZE_X, BLOCKSIZE_Y); dim3 blocks(DIVIDE(columns, BLOCKSIZE_X), DIVIDE(out_rows, BLOCKSIZE_Y)); - if (bits == 2) pack_columns_kernel<2><<>>(input, output, out_rows, columns); - if (bits == 3) pack_columns_kernel<3><<>>(input, output, out_rows, columns); - if (bits == 4) pack_columns_kernel<4><<>>(input, output, out_rows, columns); - if (bits == 5) pack_columns_kernel<5><<>>(input, output, out_rows, columns); - if (bits == 6) pack_columns_kernel<6><<>>(input, output, out_rows, columns); - if (bits == 8) pack_columns_kernel<8><<>>(input, output, out_rows, columns); + if (bits == 2) pack_columns_kernel<2><<>>(input, output, out_rows, columns); + if (bits == 3) pack_columns_kernel<3><<>>(input, output, out_rows, columns); + if (bits == 4) pack_columns_kernel<4><<>>(input, output, out_rows, columns); + if (bits == 5) pack_columns_kernel<5><<>>(input, output, out_rows, columns); + if (bits == 6) pack_columns_kernel<6><<>>(input, output, out_rows, columns); + if (bits == 8) pack_columns_kernel<8><<>>(input, output, out_rows, columns); } diff --git a/exllamav2/exllamav2_ext/cuda/pack_tensor.cuh b/exllamav2/exllamav2_ext/cuda/pack_tensor.cuh index 4231cbb..48c100e 100644 --- a/exllamav2/exllamav2_ext/cuda/pack_tensor.cuh +++ b/exllamav2/exllamav2_ext/cuda/pack_tensor.cuh @@ -8,6 +8,7 @@ void pack_rows_4_cuda ( + cudaStream_t stream, const uint16_t* input, uint32_t* output, const int rows, @@ -16,6 +17,7 @@ void pack_rows_4_cuda void pack_rows_6_cuda ( + cudaStream_t stream, const uint16_t* input, uint32_t* output, const int rows, @@ -24,6 +26,7 @@ void pack_rows_6_cuda void pack_columns_cuda ( + cudaStream_t stream, const uint16_t* input, uint32_t* output, const int in_rows, diff --git a/exllamav2/exllamav2_ext/cuda/q_matrix.cu b/exllamav2/exllamav2_ext/cuda/q_matrix.cu index df6d29a..40350e8 100644 --- a/exllamav2/exllamav2_ext/cuda/q_matrix.cu +++ b/exllamav2/exllamav2_ext/cuda/q_matrix.cu @@ -48,6 +48,8 @@ __global__ void shuffle_kernel QMatrix::QMatrix ( + cudaStream_t stream, + const int _device, const int _height, const int _width, @@ -115,7 +117,15 @@ QMatrix::QMatrix if (!is_gptq) { uint16_t* cpu_q_groups = (uint16_t*)calloc(groups * 2, sizeof(uint16_t)); - cudaMemcpy(cpu_q_groups, cuda_q_groups, groups * 2 * sizeof(uint16_t), cudaMemcpyDeviceToHost); + if (stream) + { + cudaMemcpyAsync(cpu_q_groups, cuda_q_groups, groups * 2 * sizeof(uint16_t), cudaMemcpyDeviceToHost, stream); + cudaDeviceSynchronize(); + } + else + { + cudaMemcpy(cpu_q_groups, cuda_q_groups, groups * 2 * sizeof(uint16_t), cudaMemcpyDeviceToHost); + } int row = 0; for (int i = 0; i < groups; i++) @@ -156,7 +166,7 @@ QMatrix::QMatrix if (_gptq_g_idx) { - if (!make_sequential(_gptq_g_idx)) + if (!make_sequential(_gptq_g_idx, stream)) { failed = true; //printf("FAIL\n"); @@ -182,7 +192,7 @@ QMatrix::QMatrix gridDim.x = DIVIDE(width, THREADS_X); gridDim.y = 1; - shuffle_kernel<<>>(cuda_q_weight, height, width, rows_8, rows_6, rows_5, rows_4, rows_3, rows_2); + shuffle_kernel<<>>(cuda_q_weight, height, width, rows_8, rows_6, rows_5, rows_4, rows_3, rows_2); } QMatrix::~QMatrix() @@ -584,7 +594,7 @@ __global__ void make_sequential_kernel w_new2[w_new2_row * w2_stride + w2_column] = dst; } -bool QMatrix::make_sequential(const uint32_t* cpu_g_idx) +bool QMatrix::make_sequential(const uint32_t* cpu_g_idx, cudaStream_t stream) { uint32_t* cuda_new_qweight = NULL; cudaError_t err = cudaMalloc(&cuda_new_qweight, height / 8 * width * sizeof(uint32_t)); diff --git a/exllamav2/exllamav2_ext/cuda/q_matrix.cuh b/exllamav2/exllamav2_ext/cuda/q_matrix.cuh index 464066e..9a98674 100644 --- a/exllamav2/exllamav2_ext/cuda/q_matrix.cuh +++ b/exllamav2/exllamav2_ext/cuda/q_matrix.cuh @@ -46,6 +46,8 @@ public: QMatrix ( + cudaStream_t stream, + const int _device, const int _height, const int _width, @@ -74,7 +76,7 @@ public: ~QMatrix(); void reconstruct(cudaStream_t stream, half* out, int row_a = 0, int row_b = 0); - bool make_sequential(const uint32_t* cpu_g_idx); + bool make_sequential(const uint32_t* cpu_g_idx, cudaStream_t stream); private: diff --git a/exllamav2/exllamav2_ext/cuda/quantize.cu b/exllamav2/exllamav2_ext/cuda/quantize.cu index 476139c..0076a3c 100644 --- a/exllamav2/exllamav2_ext/cuda/quantize.cu +++ b/exllamav2/exllamav2_ext/cuda/quantize.cu @@ -53,6 +53,7 @@ __global__ void quantize_rtn_kernel void quantize_rtn_cuda ( + cudaStream_t stream, float* weights, const float* scale, uint16_t* out_q, @@ -66,7 +67,7 @@ void quantize_rtn_cuda dim3 threads(BLOCKSIZE_X, 1); dim3 blocks(DIVIDE(columns, BLOCKSIZE_X), 1); - quantize_rtn_kernel<<>> + quantize_rtn_kernel<<>> ( weights, scale, @@ -135,6 +136,7 @@ __global__ void fused_quantize_adjust_kernel void fused_quantize_adjust_cuda ( + cudaStream_t stream, const float* weights, float* quant, const float* scale, @@ -151,7 +153,7 @@ void fused_quantize_adjust_cuda dim3 threads(BLOCKSIZE_X, 1); dim3 blocks(DIVIDE(columns, BLOCKSIZE_X), 1); - fused_quantize_adjust_kernel<<>> + fused_quantize_adjust_kernel<<>> ( weights, quant, @@ -216,7 +218,8 @@ __global__ void quantize_kernel void quantize_cuda ( - const float* input, + cudaStream_t stream, + const float* input, float* output, const float* scale, uint16_t* out_q, @@ -232,7 +235,7 @@ void quantize_cuda // DBGI2(rows, columns); // DBGF2(qzero, maxq); - quantize_kernel<<>> + quantize_kernel<<>> ( input, output, @@ -269,6 +272,7 @@ __global__ void adjust_error_row_kernel void adjust_error_row_cuda ( + cudaStream_t stream, const float* hessian_inv, float* error, const float* weights, @@ -281,7 +285,7 @@ void adjust_error_row_cuda dim3 threads(BLOCKSIZE_X, 1); dim3 blocks(DIVIDE(columns, BLOCKSIZE_X), 1); - adjust_error_row_kernel<<>>(hessian_inv, error, weights, quant, c, columns, hcolumns); + adjust_error_row_kernel<<>>(hessian_inv, error, weights, quant, c, columns, hcolumns); } __global__ void quantize_err_kernel @@ -334,6 +338,7 @@ __global__ void quantize_err_kernel void quantize_err_cuda ( + cudaStream_t stream, const float* input, float* output, const float* scale, @@ -353,7 +358,7 @@ void quantize_err_cuda // DBGI2(rows, columns); // DBGF2(qzero, maxq); - quantize_err_kernel<<>> + quantize_err_kernel<<>> ( input, output, @@ -399,6 +404,7 @@ __global__ void vv_mul_sub_kernel void vv_mul_sub_cuda ( + cudaStream_t stream, const float* x, const float* y, float* z, @@ -414,5 +420,5 @@ void vv_mul_sub_cuda gridDim.y = DIVIDE(x_size, BLOCKSIZE_Y); gridDim.z = 1; - vv_mul_sub_kernel<<>>(x, y, z, x_size, y_size); + vv_mul_sub_kernel<<>>(x, y, z, x_size, y_size); } diff --git a/exllamav2/exllamav2_ext/cuda/quantize.cuh b/exllamav2/exllamav2_ext/cuda/quantize.cuh index b291b10..b56863b 100644 --- a/exllamav2/exllamav2_ext/cuda/quantize.cuh +++ b/exllamav2/exllamav2_ext/cuda/quantize.cuh @@ -8,6 +8,7 @@ void quantize_cuda ( + cudaStream_t stream, const float* input, float* output, const float* scale, @@ -20,6 +21,7 @@ void quantize_cuda void adjust_error_row_cuda ( + cudaStream_t stream, const float* hessian_inv, float* error, const float* weights, @@ -31,6 +33,7 @@ void adjust_error_row_cuda void fused_quantize_adjust_cuda ( + cudaStream_t stream, const float* weights, float* quant, const float* scale, @@ -46,6 +49,7 @@ void fused_quantize_adjust_cuda void quantize_rtn_cuda ( + cudaStream_t stream, float* weights, const float* scale, uint16_t* out_q, @@ -58,6 +62,7 @@ void quantize_rtn_cuda void quantize_err_cuda ( + cudaStream_t stream, const float* input, float* output, const float* scale, @@ -73,6 +78,7 @@ void quantize_err_cuda void vv_mul_sub_cuda ( + cudaStream_t stream, const float* x, const float* y, float* z, diff --git a/exllamav2/exllamav2_ext/ext_bindings.cpp b/exllamav2/exllamav2_ext/ext_bindings.cpp index da8d2fd..22662ca 100644 --- a/exllamav2/exllamav2_ext/ext_bindings.cpp +++ b/exllamav2/exllamav2_ext/ext_bindings.cpp @@ -10,7 +10,7 @@ #include "ext_quant.h" #include "ext_sampling.h" -#include "ext_safetensors.h" +#include "ext_stloader.h" #include "ext_qmatrix.h" #include "ext_qattn.h" #include "ext_qmlp.h" @@ -45,16 +45,9 @@ PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) m.def("dump_profile_results", &dump_profile_results, "dump_profile_results"); m.def("partial_strings_match", &partial_strings_match, "partial_strings_match"); - // safetensors + // safetensors/stloader - m.def("safetensors_open", &safetensors_open, "safetensors_open"); - m.def("safetensors_open_fb", &safetensors_open_fb, "safetensors_open_fb"); - m.def("safetensors_close", &safetensors_close, "safetensors_close"); - m.def("safetensors_close_fb", &safetensors_close_fb, "safetensors_close_fb"); - m.def("safetensors_load", &safetensors_load, "safetensors_load"); - m.def("safetensors_pinned_buffer", &safetensors_pinned_buffer, "safetensors_pinned_buffer"); - m.def("safetensors_free_pinned_buffer", &safetensors_free_pinned_buffer, "safetensors_free_pinned_buffer"); - m.def("safetensors_read_fb", &safetensors_read_fb, "safetensors_read_fb"); + m.def("stloader_read", &stloader_read, "stloader_read"); m.def("tensor_remap", &tensor_remap, "tensor_remap"); m.def("tensor_remap_4bit", &tensor_remap_4bit, "tensor_remap_4bit"); @@ -68,6 +61,7 @@ PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) m.def("gemm_half_q_half_tp", &gemm_half_q_half_tp, "gemm_half_q_half_tp"); m.def("matrix_fp16_to_q4", &matrix_fp16_to_q4, "matrix_fp16_to_q4"); m.def("matrix_q4_to_fp16", &matrix_q4_to_fp16, "matrix_q4_to_fp16"); + m.def("make_group_map", &make_group_map, "make_group_map"); // qattn diff --git a/exllamav2/exllamav2_ext/ext_qmatrix.cpp b/exllamav2/exllamav2_ext/ext_qmatrix.cpp index 5baa3fb..b43223d 100644 --- a/exllamav2/exllamav2_ext/ext_qmatrix.cpp +++ b/exllamav2/exllamav2_ext/ext_qmatrix.cpp @@ -80,8 +80,12 @@ uintptr_t make_q_matrix TORCH_CHECK(temp_dq.size(0) >= dq_req, "Insufficient size of temp_dq buffer") } + const at::cuda::OptionalCUDAGuard device_guard(device_of(q_weight)); + cudaStream_t stream = at::cuda::getCurrentCUDAStream().stream(); + QMatrix* m = new QMatrix ( + stream, device, height, width, @@ -149,8 +153,12 @@ uintptr_t make_q_matrix_split TORCH_CHECK(false, "Tensor split not implemented for GPTQ matrices"); } + const at::cuda::OptionalCUDAGuard device_guard(device_of(q_weight)); + cudaStream_t stream = at::cuda::getCurrentCUDAStream().stream(); + QMatrix* m = new QMatrix ( + stream, device, height, width, @@ -328,4 +336,26 @@ void matrix_fp16_to_q4 (half*) scales.data_ptr(), numel ); +} + +torch::Tensor make_group_map(torch::Tensor& q_groups, int num_qrows) +{ + TORCH_CHECK_DTYPE(q_groups, kShort); + int num_groups = q_groups.size(0) / 2; + int16_t* gr = (int16_t*) q_groups.data_ptr(); + + std::vector group_map; + for (int i = 0; i < num_groups; ++i) + { + int bits = gr[i * 2]; + int qrows = i < num_groups - 1 ? gr[i * 2 + 3] - gr[i * 2 + 1] : num_qrows - gr[i * 2 + 1]; + int rows = (qrows * 32) / bits; + for (int j = 0; j < rows; ++j) + { + group_map.push_back(i); + group_map.push_back(rows - j); + } + } + + return torch::tensor(group_map, torch::kShort); } \ No newline at end of file diff --git a/exllamav2/exllamav2_ext/ext_qmatrix.h b/exllamav2/exllamav2_ext/ext_qmatrix.h index 2e89e64..887c601 100644 --- a/exllamav2/exllamav2_ext/ext_qmatrix.h +++ b/exllamav2/exllamav2_ext/ext_qmatrix.h @@ -1,4 +1,6 @@ +#include + uintptr_t make_q_matrix ( torch::Tensor q_weight, @@ -76,4 +78,7 @@ void matrix_fp16_to_q4 torch::Tensor scales ); +torch::Tensor make_group_map(torch::Tensor& q_groups, int num_qrows); + + diff --git a/exllamav2/exllamav2_ext/ext_quant.cpp b/exllamav2/exllamav2_ext/ext_quant.cpp index e76e56e..38e7b43 100644 --- a/exllamav2/exllamav2_ext/ext_quant.cpp +++ b/exllamav2/exllamav2_ext/ext_quant.cpp @@ -28,6 +28,7 @@ void pack_rows_4 ) { const at::cuda::OptionalCUDAGuard device_guard(device_of(input)); + cudaStream_t stream = at::cuda::getCurrentCUDAStream().stream(); TORCH_CHECK_DTYPE(input, kShort); TORCH_CHECK_DTYPE(output, kInt); @@ -39,6 +40,7 @@ void pack_rows_4 pack_rows_4_cuda ( + stream, (uint16_t*) input.data_ptr(), (uint32_t*) output.data_ptr(), rows, @@ -54,6 +56,7 @@ void pack_columns ) { const at::cuda::OptionalCUDAGuard device_guard(device_of(input)); + cudaStream_t stream = at::cuda::getCurrentCUDAStream().stream(); TORCH_CHECK_DTYPE(input, kShort); TORCH_CHECK_DTYPE(output, kInt); @@ -67,6 +70,7 @@ void pack_columns pack_columns_cuda ( + stream, (uint16_t*) input.data_ptr(), (uint32_t*) output.data_ptr(), in_rows, @@ -100,12 +104,14 @@ void quantize_err TORCH_CHECK(output.size(0) == p_grid + 1, "Output vector shape doesn't match grid") const at::cuda::OptionalCUDAGuard device_guard(device_of(input)); + cudaStream_t stream = at::cuda::getCurrentCUDAStream().stream(); int rows = input.size(0); int columns = input.size(1); quantize_err_cuda ( + stream, (float*) input.data_ptr(), (float*) output.data_ptr(), (float*) scale.data_ptr(), @@ -136,11 +142,15 @@ void quantize TORCH_CHECK_SHAPES(input, 1, output, 1, 1); TORCH_CHECK_SHAPES(input, 1, scale, 0, 1); + const at::cuda::OptionalCUDAGuard device_guard(device_of(input)); + cudaStream_t stream = at::cuda::getCurrentCUDAStream().stream(); + int rows = input.size(0); int columns = input.size(1); quantize_cuda ( + stream, (float*) input.data_ptr(), (float*) output.data_ptr(), (float*) scale.data_ptr(), diff --git a/exllamav2/exllamav2_ext/ext_safetensors.cpp b/exllamav2/exllamav2_ext/ext_safetensors.cpp deleted file mode 100644 index e32a3d8..0000000 --- a/exllamav2/exllamav2_ext/ext_safetensors.cpp +++ /dev/null @@ -1,18 +0,0 @@ -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include "config.h" -#include "ext_safetensors.h" - -#include "cpp/safetensors.h" - -#include "cpp/util.h" - - diff --git a/exllamav2/exllamav2_ext/ext_safetensors.h b/exllamav2/exllamav2_ext/ext_safetensors.h deleted file mode 100644 index 9a4e497..0000000 --- a/exllamav2/exllamav2_ext/ext_safetensors.h +++ /dev/null @@ -1,2 +0,0 @@ - -#include "cpp/safetensors.h" \ No newline at end of file diff --git a/exllamav2/exllamav2_ext/ext_sampling.cpp b/exllamav2/exllamav2_ext/ext_sampling.cpp index 5c54ae1..2bf312b 100644 --- a/exllamav2/exllamav2_ext/ext_sampling.cpp +++ b/exllamav2/exllamav2_ext/ext_sampling.cpp @@ -64,6 +64,23 @@ void apply_rep_penalty Py_END_ALLOW_THREADS } +void dbg_candidates +( + int num_candidates, + float* temp_probs, + int* temp_indices +) +{ + float sum = 0.0f; + for (int i = 0; i < num_candidates; ++i) + { + DBGIF(temp_indices[i], temp_probs[i]); + sum += temp_probs[i]; + } + DBGF(sum); + printf("----------\n"); +} + std::vector sample_basic ( torch::Tensor logits, // shape [bsz, 1, vocab_size] @@ -85,6 +102,9 @@ std::vector sample_basic float mirostat_tau, float mirostat_eta, float post_temperature, + torch::Tensor xtc_mask, + float xtc_probability = 0.0f, + float xtc_threshold = 0.1f, float min_temp = 0, float max_temp = 0.0f, float temp_exponent = 1.0f, @@ -191,7 +211,24 @@ std::vector sample_basic num_candidates = post_softmax_temperature(num_candidates, temp_probs, temp_indices, post_temperature, min_temp, max_temp, temp_exponent); } - if (num_probs || (skew != 0.0f)) + if (xtc_probability > 0.0f) + { + // dbg_candidates(num_candidates, temp_probs, temp_indices); + + num_candidates = xtc_cpu + ( + num_candidates, + temp_probs, + temp_indices, + (bool*) xtc_mask.data_ptr(), + xtc_probability, + xtc_threshold + ); + + // dbg_candidates(num_candidates, temp_probs, temp_indices); + } + + if (num_probs || skew != 0.0f) { num_candidates = pre_sort_descending(num_candidates, temp_probs, temp_indices); sort_descending(num_candidates, temp_probs, temp_indices, num_probs); @@ -230,19 +267,6 @@ std::vector sample_basic random_s = powf(random, expf(-skew)); } -// { -// float sum = 0.0f; -// float pmin = temp_probs[0]; -// float pmax = pmin; -// for (int i = 0; i < num_candidates; ++i) -// { -// if (temp_probs[i] < pmin) pmin = temp_probs[i]; -// if (temp_probs[i] > pmax) pmax = temp_probs[i]; -// sum += temp_probs[i]; -// } -// DBGF4(pmin, pmax, sum, random_s); -// } - // Scale random sampling point a little to account for FP32 rounding errors during softmax. Probs // can potentially sum to slightly less than 1 for large-vocab models float random_s_adj = random_s * 0.9998; diff --git a/exllamav2/exllamav2_ext/ext_sampling.h b/exllamav2/exllamav2_ext/ext_sampling.h index 10664a3..45d01a5 100644 --- a/exllamav2/exllamav2_ext/ext_sampling.h +++ b/exllamav2/exllamav2_ext/ext_sampling.h @@ -33,6 +33,9 @@ std::vector sample_basic float mirostat_tau, float mirostat_eta, float post_temperature, + torch::Tensor xtc_mask, + float xtc_probability, + float xtc_threshold, float min_temp, float max_temp, float temp_exponent, diff --git a/exllamav2/exllamav2_ext/ext_stloader.cpp b/exllamav2/exllamav2_ext/ext_stloader.cpp new file mode 100644 index 0000000..9700648 --- /dev/null +++ b/exllamav2/exllamav2_ext/ext_stloader.cpp @@ -0,0 +1,212 @@ +#include "ext_stloader.h" +#include "cpp/util.h" + +#include +#include +#include +#include +#include +#include + +void stloader_read +( + const char* filename, + size_t offset, + size_t size, + torch::Tensor target +) +{ + c10::optional device = torch::device_of(target); + bool target_cpu = (device.has_value() && device->type() == torch::kCPU); + cudaStream_t stream; + + // Buffers + + uint8_t* load_buffer; + uint8_t* cuda_buffer; + if (target_cpu) + { + load_buffer = (uint8_t*) target.data_ptr(); + cuda_buffer = nullptr; + } + else + { + load_buffer = (uint8_t*) malloc(size); + TORCH_CHECK(load_buffer, "Can't allocate buffer for tensor"); + cuda_buffer = (uint8_t*) target.data_ptr(); + cudaSetDevice(device.value().index()); + stream = at::cuda::getCurrentCUDAStream(device.value().index()).stream(); + } + + // Synchronization + + Py_BEGIN_ALLOW_THREADS + + volatile bool load_failed = false; + std::mutex mtx; + std::deque> dq; + std::condition_variable cv; + + // Load chunks + + auto load_worker = [&] (size_t pos_a) + { + FILE* file = fopen(filename, "rb"); + if (!file) goto error; + + while (pos_a < size && !load_failed) + { + size_t pos_b = pos_a + STLOADER_BLOCK_SIZE; + if (pos_b > size) pos_b = size; + + #ifdef __linux__ + ssize_t br = pread(fileno(file), load_buffer + pos_a, pos_b - pos_a, offset + pos_a); + if (br != pos_b - pos_a) goto error; + int sr = fseek(file, offset + pos_a, SEEK_SET); + #else + int sr = _fseeki64(file, static_cast<__int64>(offset + pos_a), SEEK_SET); + if (sr) goto error; + size_t br = fread(load_buffer + pos_a, 1, pos_b - pos_a, file); + if (br != pos_b - pos_a) goto error; + #endif + + { + std::lock_guard lock(mtx); + dq.push_back(std::pair(pos_a, pos_b)); + cv.notify_one(); + } + + // DBGX3(pos_a, pos_b, br); + pos_a += STLOADER_THREADS * STLOADER_BLOCK_SIZE; + } + + fclose(file); + return; + + error: + load_failed = true; + }; + + // Copy chunks to device + + auto copy_worker = [&] () + { + size_t total_blocks = DIVIDE(size, STLOADER_BLOCK_SIZE); + while (total_blocks && !load_failed) + { + size_t pos_a, pos_b; + { + std::unique_lock lock(mtx); + cv.wait(lock, [&dq] { return !dq.empty(); }); + + auto pop = dq.front(); + dq.pop_front(); + total_blocks--; + pos_a = std::get<0>(pop); + pos_b = std::get<1>(pop); + + while (!dq.empty() && std::get<0>(dq.front()) == pos_b) + { + pop = dq.front(); + dq.pop_front(); + pos_b = std::get<1>(pop); + total_blocks--; + } + } + + // DBGX2(pos_a, pos_b); + cudaError_t cr = cudaMemcpyAsync + ( + cuda_buffer + pos_a, + load_buffer + pos_a, + pos_b - pos_a, + cudaMemcpyHostToDevice, + stream + ); + if (cr != cudaSuccess) goto error; + } + return; + + error: + load_failed = true; + }; + + std::vector threads; + for (size_t i = 0; i < STLOADER_THREADS && i * STLOADER_BLOCK_SIZE < size; ++i) + threads.emplace_back(load_worker, i * STLOADER_BLOCK_SIZE); + if (cuda_buffer) + threads.emplace_back(copy_worker); + for (auto& thread : threads) + thread.join(); + + TORCH_CHECK(!load_failed, "I/O error reading tensor"); + + if (!target_cpu) + { + free(load_buffer); + cudaDeviceSynchronize(); + } + + Py_END_ALLOW_THREADS +} + +void tensor_remap +( + torch::Tensor tensor, + torch::Tensor index +) +{ + TORCH_CHECK_SHAPES(tensor, 1, index, 0, 1); + TORCH_CHECK_DTYPE(tensor, kInt); + TORCH_CHECK_DTYPE(index, kInt); + + int rows = tensor.size(0); + int cols = tensor.size(1); + uint32_t* temp = (uint32_t*) calloc(cols, sizeof(int)); + uint32_t* a = (uint32_t*) tensor.data_ptr(); + uint32_t* idx = (uint32_t*) index.data_ptr(); + + for (int r = 0; r < rows; ++r) + { + memcpy(temp, a, sizeof(uint32_t) * cols); + for (int c = 0; c < cols; ++c) + { + *a++ = temp[idx[c]]; + } + } + free(temp); +} + +void tensor_remap_4bit +( + torch::Tensor tensor, + torch::Tensor index +) +{ + TORCH_CHECK_SHAPES(index, 0, tensor, 1, 8); + TORCH_CHECK_DTYPE(tensor, kInt); + TORCH_CHECK_DTYPE(index, kInt); + + int rows = tensor.size(0); + int cols = index.size(0); + uint32_t* temp = (uint32_t*) calloc(cols / 8, sizeof(int)); + uint32_t* a = (uint32_t*) tensor.data_ptr(); + uint32_t* idx = (uint32_t*) index.data_ptr(); + + for (int r = 0; r < rows; ++r) + { + memcpy(temp, a, sizeof(uint32_t) * cols / 8); + for (int c = 0; c < cols;) + { + uint32_t rv = 0; + for (int b = 0; b < 8; ++b, ++c) + { + uint32_t i = idx[c]; + uint32_t v = (temp[i / 8] >> ((i & 7) * 4) & 0x0f); + rv |= v << (b * 4); + } + *a++ = rv; + } + } + free(temp); +} diff --git a/exllamav2/exllamav2_ext/ext_stloader.h b/exllamav2/exllamav2_ext/ext_stloader.h new file mode 100644 index 0000000..a2c92ca --- /dev/null +++ b/exllamav2/exllamav2_ext/ext_stloader.h @@ -0,0 +1,31 @@ + +#include +#include +#include +#include +#include +#include +#include + +#define STLOADER_BLOCK_SIZE (1*1024*1024) +#define STLOADER_THREADS 8 + +void stloader_read +( + const char* filename, + size_t offset, + size_t size, + torch::Tensor target +); + +void tensor_remap +( + torch::Tensor tensor, + torch::Tensor index +); + +void tensor_remap_4bit +( + torch::Tensor tensor, + torch::Tensor index +); \ No newline at end of file diff --git a/exllamav2/ext.py b/exllamav2/ext.py index 91b6f4f..4cfa428 100644 --- a/exllamav2/ext.py +++ b/exllamav2/ext.py @@ -6,6 +6,7 @@ import sys import platform import threading from exllamav2.util import get_basic_progress +from exllamav2.compat import safe_move_tensor extension_name = "exllamav2_ext" verbose = False # Print wall of text when compiling @@ -213,7 +214,7 @@ if build_jit: "ext_qmlp.cpp", "ext_quant.cpp", "ext_rope.cpp", - "ext_safetensors.cpp", + "ext_stloader.cpp", "ext_sampling.cpp", "ext_element.cpp", "ext_tp.cpp", @@ -250,7 +251,6 @@ if build_jit: "cpp/generator.cpp", "cpp/sampling.cpp", "cpp/sampling_avx2.cpp", - "cpp/safetensors.cpp" ] sources = [os.path.join(sources_dir, s) for s in sources_] @@ -298,13 +298,11 @@ none_tensor = torch.empty((1, 1), device = "meta") # Group map needed for irregular group sizes -def make_group_map(q_groups: torch.Tensor, num_qrows: int) -> torch.Tensor: - +def make_group_map_py(q_groups: torch.Tensor, num_qrows: int) -> torch.Tensor: gr = q_groups.tolist() group_map = [] num_groups = len(gr) // 2 row = 0 - for i in range(num_groups): bits = gr[i * 2] if i < num_groups - 1: @@ -315,9 +313,12 @@ def make_group_map(q_groups: torch.Tensor, num_qrows: int) -> torch.Tensor: for j in range(rows): group_map += [i] group_map += [rows - j] - return torch.tensor(group_map, dtype = torch.short, device = q_groups.device) +def make_group_map(q_groups: torch.Tensor, num_qrows: int) -> torch.Tensor: + group_map = ext_c.make_group_map(q_groups.cpu(), num_qrows) + return safe_move_tensor(group_map, q_groups.device) + # Create Q matrix diff --git a/exllamav2/fasttensors.py b/exllamav2/fasttensors.py deleted file mode 100644 index db9a9a3..0000000 --- a/exllamav2/fasttensors.py +++ /dev/null @@ -1,252 +0,0 @@ -from __future__ import annotations -import torch -from safetensors import safe_open -import numpy as np -import json -from exllamav2.ext import exllamav2_ext as ext_c -import os - -def convert_dtype(dt: str): - """ - :param dt: - Datatype string as used by safetensors - - :return: - Torch type, element size in bytes - """ - if dt == "I32": return torch.int, 4 - elif dt == "I16": return torch.short, 2 - elif dt == "F16": return torch.float16, 2 - elif dt == "BF16": return torch.bfloat16, 2 - elif dt == "F32": return torch.float, 4 - else: raise ValueError(f"Unknown dtype {dt}") - -global_stfiles = [] -global_cm = {} -global_tensorcache = [] -num_cached_tensors = 4 - -def cleanup_stfiles(): - global global_stfiles, global_cm - - for stf in global_stfiles: - stf.close() - global_stfiles = [] - - for f in global_cm.values(): - f.__exit__(None, None, None) - global_cm = {} - - ext_c.safetensors_free_pinned_buffer() - - -class STFile: - - filename: str - header: dict - header_size: int - metadata: object - handle: int - handle_fb: int - fast: bool - st_context = None - tensor_remap: dict | None - - def __init__( - self, - filename: str, - fast: bool = True, - keymap: list[tuple[str, str]] = None - ): - global global_stfiles - - self.metadata = None - self.handle = 0 - self.handle_fb = 0 - self.filename = filename - self.st_context = None - self.tensor_remap = None - - self.read_dict() - - if keymap: - self.tensor_remap = {} - nheader = {} - for key in self.header.keys(): - nkey = key - for z in keymap: - if z[0].startswith("$") and nkey.startswith(z[0][1:]): - nkey = ("$" + nkey).replace(z[0], z[1]) - else: - nkey = nkey.replace(z[0], z[1]) - nheader[nkey] = self.header[key] - self.tensor_remap[nkey] = key - self.header = nheader - - # TODO: Create platform-independent multithreaded loader to replace direct IO loader on Linux and fallback - # cstdio loader on Windows - - if fast and os.name == "nt": - self.fast_fb = True - fast = False - else: - self.fast_fb = False - - self.fast = fast - if self.fast: - self.handle = ext_c.safetensors_open(filename) - if self.fast_fb: - self.handle_fb = ext_c.safetensors_open_fb(filename) - - global_stfiles.append(self) - - - @staticmethod - def open( - filename, - fast = True, - keymap: list[tuple[str, str]] = None - ) -> STFile: - """ - Open safetensors file, scan header and retain handle. - - :param filename: - Filename - - :param fast: - Use fast (direct I/O) codepath - - :param keymap: - List of (a, b) tuples for string replacements in key index - - :return: - STFile object - """ - - global global_stfiles - for f in global_stfiles: - if f.filename == filename: return f - return STFile(filename, fast, keymap) - - - def close(self): - """ - Close file handle (if necessary) - """ - if self.fast: ext_c.safetensors_close(self.handle) - if self.fast_fb: ext_c.safetensors_close_fb(self.handle_fb) - - - def read_dict(self): - with open(self.filename, "rb") as fp: - header_size = np.fromfile(fp, dtype = np.int64, count = 1).item() - header_json = fp.read(header_size) - self.header = json.loads(header_json.decode("utf-8")) - self.header_size = fp.tell() - if "__metadata__" in self.header: - self.metadata = self.header["__metadata__"] - del self.header["__metadata__"] - - - def get_dict(self) -> dict: - return self.header - - - def get_metadata(self) -> object: - return self.metadata - - - def measure(self, key): - """ - :param key: - Tensor key - - :return: - Byte size of tensor - """ - v = self.header[key] - data_offsets = v["data_offsets"] - length = data_offsets[1] - data_offsets[0] - return length - - - def get_cm(self, device): - global global_cm - - cm_key = self.filename + "::" + device - - if cm_key in global_cm: - return global_cm[cm_key] - - f = safe_open(self.filename, framework = "pt", device = device) - f.__enter__() - global_cm[cm_key] = f - return f - - - def get_tensor( - self, - key: str, - device, - not_fast: bool = False, - cached: bool = False, - out_dtype = None - ) -> torch.Tensor: - global global_tensorcache - - torch.cuda.synchronize() - - if device != "cpu": - torch.cuda.set_stream(torch.cuda.default_stream(device)) - - if self.tensor_remap and (not_fast or not self.fast): - key = self.tensor_remap[key] - - if cached: - cachekey = self.filename + "::" + key + "::" + device - for (k, v) in global_tensorcache: - if k == cachekey: return v - - if not_fast or (not self.fast and not self.fast_fb): - - # with safe_open(self.filename, framework = "pt", device = device) as f: - f = self.get_cm(device) - tensor = f.get_tensor(key) - - elif self.fast_fb: - - h = self.header[key] - dtype, esize = convert_dtype(h["dtype"]) - beg, end = h["data_offsets"] - size = end - beg - numel = size // esize - shape = h["shape"] - tensor = torch.zeros(shape, dtype = dtype, device = device) - assert tensor.is_contiguous, "Non-contiguous tensor" - ext_c.safetensors_read_fb(self.handle_fb, beg + self.header_size, size, tensor) - - else: - - v = self.header[key] - dtt, dts = convert_dtype(v["dtype"]) - sh = v["shape"] - data_offsets = v["data_offsets"] - offset = data_offsets[0] + self.header_size - length = data_offsets[1] - data_offsets[0] - assert np.prod(sh) * dts == length, f"Tensor shape doesn't match storage size: {key}" - if device != "cpu": - torch.cuda.set_stream(torch.cuda.default_stream(device)) - tensor = torch.empty(sh, device = device, dtype = dtt) - ext_c.safetensors_load(self.handle, tensor, offset, length) - - if out_dtype: - tensor = tensor.to(out_dtype) - - if cached: - if len(global_tensorcache) >= num_cached_tensors: - global_tensorcache = global_tensorcache[1:] - global_tensorcache.append((cachekey, tensor)) - - torch.cuda.synchronize() - - return tensor diff --git a/exllamav2/generator/sampler.py b/exllamav2/generator/sampler.py index 9f795c8..dcad0ee 100644 --- a/exllamav2/generator/sampler.py +++ b/exllamav2/generator/sampler.py @@ -94,6 +94,10 @@ class ExLlamaV2Sampler: ngram_index: int = 0 ngram_history: deque[int] = field(default_factory = deque) + xtc_probability: float = 0.0 # 0 to disable + xtc_threshold: float = 0.1 + xtc_ignore_tokens: frozenset[int] | None = None + @staticmethod def greedy(**kwargs) -> ExLlamaV2Sampler.Settings: defaults = { @@ -127,6 +131,9 @@ class ExLlamaV2Sampler: c.dry_sequence_breakers = self.dry_sequence_breakers c.dry_max_ngram = self.dry_max_ngram c.filters = [] + c.xtc_probability = self.xtc_probability + c.xtc_threshold = self.xtc_threshold + c.xtc_ignore_tokens = self.xtc_ignore_tokens return c @@ -264,6 +271,38 @@ class ExLlamaV2Sampler: penalties = torch.tensor([[[penalty * node.value for node in penalty_tokens.values()]]], dtype = torch.float) logits.scatter_add_(-1, indices, penalties) + + @staticmethod + @lru_cache(10) + def get_default_xtc_mask_tokens( + tokenizer: ExLlamaV2Tokenizer, + ) -> frozenset[int]: + result = set() + xtc_mask_chars = r"\n" + pattern = re.compile(r"[" + xtc_mask_chars + "]") + pieces = tokenizer.get_id_to_piece_list(include_special_tokens = True) + for t in range(len(pieces)): + if bool(pattern.search(pieces[t])): + result.add(t) + for t in tokenizer.extended_id_to_piece.keys(): + result.add(t) + return frozenset(result) + + + @staticmethod + @lru_cache(10) + def get_xtc_mask_tensor( + tokenizer: ExLlamaV2Tokenizer, + vocab_size: int, + xtc_mask_tokens: frozenset[int] | None + ): + if xtc_mask_tokens is None: + xtc_mask_tokens = ExLlamaV2Sampler.get_default_xtc_mask_tokens(tokenizer) + mask = torch.ones((vocab_size,), dtype = torch.bool) + mask[list(xtc_mask_tokens)] = False + return mask + + @staticmethod # @profile def sample( @@ -465,6 +504,14 @@ class ExLlamaV2Sampler: if vs < logits.shape[-1]: logits[:, :, vs:] = float("-inf") + # XTC mask + + xtc_mask = none_tensor + if settings.xtc_probability > 0.0: + xtc_mask = ExLlamaV2Sampler.get_xtc_mask_tensor( + tokenizer, logits.shape[-1], settings. xtc_ignore_tokens + ) + # Sampling output_tokens = torch.empty((batch_size, 1), dtype = torch.long) @@ -498,6 +545,9 @@ class ExLlamaV2Sampler: settings.mirostat_tau, settings.mirostat_eta, settings.temperature if settings.temperature_last else 1.0, + xtc_mask, + settings.xtc_probability, + settings.xtc_threshold, settings.min_temp, settings.max_temp, settings.temp_exponent, @@ -510,7 +560,7 @@ class ExLlamaV2Sampler: # Stop condition from filters end_filter = False - if len(filters) > 0 and output_tokens[0].item() in end_tokens: + if len(filters) > 0 and end_tokens is not None and output_tokens[0].item() in end_tokens: end_filter = True return output_tokens, output_ktokens, output_kprobs, output_probs, end_filter diff --git a/exllamav2/linear.py b/exllamav2/linear.py index 5d6855d..5384727 100644 --- a/exllamav2/linear.py +++ b/exllamav2/linear.py @@ -560,16 +560,16 @@ class ExLlamaV2Linear(ExLlamaV2Module): w = { "q_scale": safe_move_tensor(self.q_tensors["q_scale"][:, a // 8:b // 8], idx).contiguous(), - "q_scale_max": safe_move_tensor(self.q_tensors["q_scale_max"], idx).contiguous(), - "q_group_map": safe_move_tensor(self.q_tensors["q_group_map"], idx).contiguous(), - "q_groups": safe_move_tensor(self.q_tensors["q_groups"], idx).contiguous(), + "q_scale_max": safe_move_tensor(self.q_tensors["q_scale_max"], idx), + "q_group_map": safe_move_tensor(self.q_tensors["q_group_map"], idx), + "q_groups": safe_move_tensor(self.q_tensors["q_groups"], idx), "q_weight": safe_move_tensor(self.q_tensors["q_weight"][:, a:b], idx).contiguous() } if "q_perm" in self.q_tensors: w.update({ - "q_perm": safe_move_tensor(self.q_tensors["q_perm"], idx).contiguous(), - "q_invperm": safe_move_tensor(self.q_tensors["q_invperm"], idx).contiguous(), + "q_perm": safe_move_tensor(self.q_tensors["q_perm"], idx), + "q_invperm": safe_move_tensor(self.q_tensors["q_invperm"], idx), }) if "bias" in self.q_tensors: diff --git a/exllamav2/model.py b/exllamav2/model.py index 3fb4f5b..b5ba5ab 100644 --- a/exllamav2/model.py +++ b/exllamav2/model.py @@ -46,7 +46,7 @@ from exllamav2.parallel_decoder import ExLlamaV2ParallelDecoder from exllamav2.embedding import ExLlamaV2Embedding from exllamav2.pos_embedding import ExLlamaV2PosEmbedding from exllamav2.compat import safe_move_tensor -from exllamav2.fasttensors import cleanup_stfiles +from exllamav2.stloader import cleanup_stfiles from exllamav2.device import ExLlamaV2DeviceContext, set_device_streams from exllamav2.tensor_p import TPContext, BROADCAST_VC import gc @@ -299,6 +299,7 @@ class ExLlamaV2: callback_gen: Callable[[int, int], None] | None = None ): with torch.inference_mode(): + set_device_streams() stats_ = self.set_device_map(gpu_split or [99999]) @@ -406,6 +407,7 @@ class ExLlamaV2: # Load module weights with torch.inference_mode(): + set_device_streams() for idx in range(len(self.modules)): module = self.modules[idx] @@ -514,6 +516,7 @@ class ExLlamaV2: loras = None # TODO: Autosplit load with LoRAs with torch.inference_mode(): + set_device_streams() self.device_context = [] @@ -578,7 +581,11 @@ class ExLlamaV2: module.set_device_idx(current_device) - hidden_state_backup = safe_move_tensor(hidden_state, "cpu").clone() + hidden_state_backup = hidden_state + # if hidden_state.device == "cpu": + # hidden_state_backup = hidden_state.clone() + # else: + # hidden_state_backup = safe_move_tensor(hidden_state, "cpu").clone() try: if isinstance(module, ExLlamaV2Attention) or \ diff --git a/exllamav2/model_init.py b/exllamav2/model_init.py index f03c3a3..fdaa160 100644 --- a/exllamav2/model_init.py +++ b/exllamav2/model_init.py @@ -22,7 +22,7 @@ def add_args(parser): parser.add_argument("-lm", "--low_mem", action = "store_true", help = "Enable VRAM optimizations, potentially trading off speed") parser.add_argument("-ept", "--experts_per_token", type = int, help = "Override MoE model's default number of experts per token") parser.add_argument("-lq4", "--load_q4", action = "store_true", help = "Load weights in Q4 mode") - parser.add_argument("-fst", "--fast_safetensors", action = "store_true", help = "Use alternative safetensors loader (with direct I/O when available)") + parser.add_argument("-fst", "--fast_safetensors", action = "store_true", help = "Deprecated (does nothing)") parser.add_argument("-ic", "--ignore_compatibility", action = "store_true", help = "Do not override model config options in case of compatibility issues") parser.add_argument("-chunk", "--chunk_size", type = int, help = "Chunk size ('input length')") @@ -43,7 +43,6 @@ def print_options(args): if args.no_sdpa: print_opts += ["no_sdpa"] if args.no_graphs: print_opts += ["no_graphs"] if args.low_mem: print_opts += ["low_mem"] - if hasattr(args, "fast_safetensors") and args.fast_safetensors: print_opts += ["fast_safetensors"] if args.experts_per_token is not None: print_opts += [f"experts_per_token: {args.experts_per_token}"] if args.load_q4: print_opts += ["load_q4"] if args.ignore_compatibility: print_opts += ["ignore_compatibility"] @@ -94,7 +93,6 @@ def init( config = ExLlamaV2Config() config.model_dir = args.model_dir - config.fasttensors = hasattr(args, "fast_safetensors") and args.fast_safetensors config.prepare() # Set config options diff --git a/exllamav2/module.py b/exllamav2/module.py index 5bd672a..dc25e34 100644 --- a/exllamav2/module.py +++ b/exllamav2/module.py @@ -2,7 +2,7 @@ from __future__ import annotations import torch import torch.nn as nn from exllamav2.config import ExLlamaV2Config -from exllamav2.fasttensors import STFile +from exllamav2.stloader import STFile from exllamav2.compat import safe_move_tensor from typing import TYPE_CHECKING @@ -81,7 +81,7 @@ class ExLlamaV2Module: submap_i[v].append(k) for v, ks in submap_i.items(): - stfile = STFile.open(v, fast = self.model.config.fasttensors, keymap = self.model.config.arch.keymap) + stfile = STFile.open(v, keymap = self.model.config.arch.keymap) for k in ks: if measure: size += stfile.measure(key + "." + k) @@ -160,9 +160,8 @@ class ExLlamaV2Module: filename = cfg.tensor_file_map.get(key) if not filename: continue - stfile = STFile.open(filename, fast = cfg.fasttensors, keymap = cfg.arch.keymap) - # tensor = stfile.get_tensor(key, device = self.device()).half() - tensor = stfile.get_tensor(key, device = "cpu", cached = True, out_dtype = torch.half) + stfile = STFile.open(filename, keymap = cfg.arch.keymap) + tensor = stfile.get_tensor(key, device = "cpu", out_dtype = torch.half) if cfg.arch.orig_weights_transposed and len(tensor.shape) == 2: tensor = tensor.T @@ -185,7 +184,7 @@ class ExLlamaV2Module: tensor.shape[0] == in_feat: tensor = tensor.T - tensor = tensor.contiguous().to(self.device()) + tensor = safe_move_tensor(tensor.contiguous(), self.device()) res.append(nn.Parameter(tensor, requires_grad = False)) if len(res) == 2: return res[0], res[1] diff --git a/exllamav2/stloader.py b/exllamav2/stloader.py new file mode 100644 index 0000000..43a6082 --- /dev/null +++ b/exllamav2/stloader.py @@ -0,0 +1,168 @@ +from __future__ import annotations +import torch +import numpy as np +import json +from exllamav2.ext import none_tensor, exllamav2_ext as ext_c +import os + +def convert_dtype(dt: str): + """ + :param dt: + Datatype string as used by safetensors + + :return: + Torch type, element size in bytes + """ + if dt == "I32": return torch.int, 4 + elif dt == "I16": return torch.short, 2 + elif dt == "F16": return torch.float16, 2 + elif dt == "BF16": return torch.bfloat16, 2 + elif dt == "F32": return torch.float, 4 + else: + raise ValueError(f"Unknown dtype {dt}") + +global_stfiles = {} + +def cleanup_stfiles(): + """ + Close all file handles and free any storage used while loading tensors + """ + global global_stfiles + # for stf in global_stfiles: + # stf.close() + global_stfiles = {} + # ext_c.stfiles_free_pinned_buffer() + +class STFile: + + filename: str + header: dict + header_size: int + metadata: object + handle: int + tensor_remap: dict | None + + def __init__( + self, + filename: str, + keymap: list[tuple[str, str]] = None + ): + self.filename = filename + self.metadata = None + self.handle = 0 + self.tensor_remap = None + self.read_dict() + if keymap: + self.remap_dict(keymap) + + @staticmethod + def open( + filename, + keymap: list[tuple[str, str]] = None + ) -> STFile: + """ + Open safetensors file, scan header and retain handle. + + :param filename: + Filename + + :param keymap: + List of (a, b) tuples for string replacements in key index + + :return: + STFile object + """ + global global_stfiles + if filename not in global_stfiles: + global_stfiles[filename] = STFile(filename, keymap) + return global_stfiles[filename] + + def close(self): + """ + Close file handle (if necessary) + """ + assert self.filename in global_stfiles, \ + f"Can't close {self.filename}: already closed" + # if self.fast_fb: ext_c.safetensors_close_fb(self.handle_fb) + + def read_dict(self): + with open(self.filename, "rb") as fp: + header_size = np.fromfile(fp, dtype = np.int64, count = 1).item() + header_json = fp.read(header_size) + self.header = json.loads(header_json.decode("utf-8")) + self.header_size = fp.tell() + if "__metadata__" in self.header: + self.metadata = self.header["__metadata__"] + del self.header["__metadata__"] + + def remap_dict(self, keymap: list[tuple[str, str]]): + self.tensor_remap = {} + nheader = {} + for key in self.header.keys(): + nkey = key + for z in keymap: + if z[0].startswith("$") and nkey.startswith(z[0][1:]): + nkey = ("$" + nkey).replace(z[0], z[1]) + else: + nkey = nkey.replace(z[0], z[1]) + nheader[nkey] = self.header[key] + self.tensor_remap[nkey] = key + self.header = nheader + + def get_dict(self) -> dict: + return self.header + + def get_metadata(self) -> object: + return self.metadata + + def measure(self, key): + """ + :param key: + Tensor key + + :return: + Byte size of tensor + """ + v = self.header[key] + data_offsets = v["data_offsets"] + length = data_offsets[1] - data_offsets[0] + return length + + def get_tensor( + self, + key: str, + device, + out_dtype = None + ) -> torch.Tensor: + """ + Load tensor from file + + :param key: + Tensor name + + :param device: + Target device + + :param out_dtype: + Force output datatype + + :return: + torch.Tensor + """ + h = self.header[key] + dtype, esize = convert_dtype(h["dtype"]) + beg, end = h["data_offsets"] + size = end - beg + shape = h["shape"] + tensor = torch.zeros(shape, dtype = dtype, device = device) + torch.cuda.synchronize() + assert tensor.is_contiguous, "Non-contiguous tensor" + ext_c.stloader_read( + self.filename, + beg + self.header_size, + size, + tensor + ) + if out_dtype: + tensor = tensor.to(out_dtype) + return tensor \ No newline at end of file diff --git a/exllamav2/tokenizer/tokenizer.py b/exllamav2/tokenizer/tokenizer.py index 9e8d8d0..552f2e1 100644 --- a/exllamav2/tokenizer/tokenizer.py +++ b/exllamav2/tokenizer/tokenizer.py @@ -72,7 +72,13 @@ class ExLlamaV2Tokenizer: tokenizer_config_dict: dict | None - def __init__(self, config, lazy_init = True, force_json = False): + def __init__( + self, + config, + lazy_init = True, + force_json = False, + force_spm = False + ): """ Initialize tokenizer from model config @@ -83,7 +89,13 @@ class ExLlamaV2Tokenizer: Defer initialization of some data structures to speed up loading :param force_json: - Use only tokenizer.json (HF) even if tokenizer.model (SPM) is available + No effect from v0.2.3. tokenizer.json is now preferred over tokenizer.model by default. + If True and no tokenizer.json is present in the model directory, will emit a warning before + falling back to SPM + + :param force_spm: + Use only tokenizer.model (SentencePiece) even if tokenizer.model (HF Tokenizers) + is available """ self.config = config @@ -110,9 +122,14 @@ class ExLlamaV2Tokenizer: path_spm = os.path.join(self.config.model_dir, "tokenizer.model") path_hf = os.path.join(self.config.model_dir, "tokenizer.json") - if os.path.exists(path_spm) and not force_json: self.tokenizer_model = ExLlamaV2TokenizerSPM(path_spm) - elif os.path.exists(path_hf): self.tokenizer_model = ExLlamaV2TokenizerHF(path_hf) - else: raise FileNotFoundError("No supported tokenizer found.") + if os.path.exists(path_hf) and not force_spm: + self.tokenizer_model = ExLlamaV2TokenizerHF(path_hf) + elif os.path.exists(path_spm): + if force_json: + print(" !! Warning: Tokenizer loading with force_json = True but no tokenizer.json found, falling back to tokenizer.model") + self.tokenizer_model = ExLlamaV2TokenizerSPM(path_spm) + else: + raise FileNotFoundError("No supported tokenizer found.") # Attempt to load added tokens from tokenizer.json diff --git a/setup.py b/setup.py index 847c913..903f0ae 100644 --- a/setup.py +++ b/setup.py @@ -41,7 +41,7 @@ setup_kwargs = { "exllamav2/exllamav2_ext/ext_qmlp.cpp", "exllamav2/exllamav2_ext/ext_quant.cpp", "exllamav2/exllamav2_ext/ext_rope.cpp", - "exllamav2/exllamav2_ext/ext_safetensors.cpp", + "exllamav2/exllamav2_ext/ext_stloader.cpp", "exllamav2/exllamav2_ext/ext_sampling.cpp", "exllamav2/exllamav2_ext/ext_element.cpp", "exllamav2/exllamav2_ext/ext_tp.cpp", @@ -78,7 +78,6 @@ setup_kwargs = { "exllamav2/exllamav2_ext/cpp/generator.cpp", "exllamav2/exllamav2_ext/cpp/sampling.cpp", "exllamav2/exllamav2_ext/cpp/sampling_avx2.cpp", - "exllamav2/exllamav2_ext/cpp/safetensors.cpp" ], extra_compile_args=extra_compile_args, libraries=["cublas"] if windows else [], diff --git a/tests/test_fasttensors.py b/tests/test_stloader.py similarity index 55% rename from tests/test_fasttensors.py rename to tests/test_stloader.py index 1e2a475..69756d1 100644 --- a/tests/test_fasttensors.py +++ b/tests/test_stloader.py @@ -2,32 +2,18 @@ import sys, os sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) import torch -from exllamav2.fasttensors import STFile -from exllamav2.ext import exllamav2_ext as ext_c - +from exllamav2.stloader import STFile import time -# Single tensor - -# stfile = "/mnt/str/models/llama2-70b-exl2/3.5bpw/output-00002-of-00004.safetensors" -# stfile_size = os.path.getsize(stfile) -# sttest = STFile(stfile) -# key = "model.layers.45.self_attn.o_proj.q_weight" -# print(key) -# a = sttest.get_tensor(key, device="cuda:0") -# b = sttest.get_tensor(key, device="cuda:0", not_fast = True) -# assert a.equal(b), ":<" - - # Multi file stfiles = \ [ - "/mnt/str/models/llama2-70b-exl2/4.0bpw/output-00001-of-00005.safetensors", - "/mnt/str/models/llama2-70b-exl2/4.0bpw/output-00002-of-00005.safetensors", - "/mnt/str/models/llama2-70b-exl2/4.0bpw/output-00003-of-00005.safetensors", - "/mnt/str/models/llama2-70b-exl2/4.0bpw/output-00004-of-00005.safetensors", - "/mnt/str/models/llama2-70b-exl2/4.0bpw/output-00005-of-00005.safetensors" + "/mnt/str/models/llama3-70b-exl2/4.0bpw/output-00001-of-00005.safetensors", + "/mnt/str/models/llama3-70b-exl2/4.0bpw/output-00002-of-00005.safetensors", + "/mnt/str/models/llama3-70b-exl2/4.0bpw/output-00003-of-00005.safetensors", + "/mnt/str/models/llama3-70b-exl2/4.0bpw/output-00004-of-00005.safetensors", + "/mnt/str/models/llama3-70b-exl2/4.0bpw/output-00005-of-00005.safetensors" ] for stfile in stfiles: @@ -45,7 +31,6 @@ for stfile in stfiles: tensors2 = {} t = time.time() - ext_c.safetensors_pinned_buffer() t = time.time() - t print(f"Time: {t:.4f} s") @@ -56,21 +41,21 @@ for stfile in stfiles: t = time.time() for k in keys: tensor = sttest.get_tensor(k, device = "cuda:0") - tensors1[k] = tensor + tensors2[k] = tensor t = time.time() - t print(f"Time: {t:.4f} s, {stfile_size / t / 1024**3:.4f} GB/s") t = time.time() for k in keys: - tensor = sttest.get_tensor(k, device = "cuda:0", not_fast = True) - tensors2[k] = tensor + tensor = sttest.get_tensor(k, device = "cpu") + tensors1[k] = tensor t = time.time() - t print(f"Time: {t:.4f} s, {stfile_size / t / 1024**3:.4f} GB/s") for k in sttest.get_dict().keys(): a = tensors1[k] b = tensors2[k] - assert a.equal(b), k + assert a.cuda().equal(b), k print("ok") ```
turboderp commented 1 week ago

This is tricky. I'm not able to reproduce it here, and expandable_segments is still a poorly documented, experimental feature in PyTorch. It's especially weird because the commit that apparently breaks it doesn't change anything at all except for YaRN models, and Mistral-Large doesn't use YaRN.

turboderp commented 1 week ago

Are you sure it's working reliably up to commit b2af0bb?

ThomasBaruzier commented 1 week ago

Here is the list of commits crashing

03b2d55 segfault cad7848 segfault ef7cdda segfault 5d43593 segfault c84f597 segfault f1adff9 segfault 4314792 segfault be3de0f segfault d393bfe segfault 6b73184 segfault 8dca1ab segfault b195503 segfault aff1e5a segfault 0d78f03 segfault b2af0bb working 3a38913 working e960dfd working 7c7b199 segfault 8361f3f segfault 15e5404 segfault a5132d0 segfault 6d7b2e8 segfault 43a0be3 segfault a17f666 segfault 9946f45 segfault e155e0a segfault c4a03e0 segfault 12bceb9 segfault 0695f3a segfault 8a25e0f segfault b252107 segfault 144c576 working 10a8842 working b2c7cf2 working 46eff43 working 228ba34 working a372fe1 working aadc454 working 5ee9835 working 1df7b04 working 1e18e80 working 0d9adf9 working f0dca9a working f2c53ef working a029bcd working 361d211 working c1fed2e working 3e8e181 working affdc0d working 5c455c1 working c9ce168 working 1e462f1 working c18400f working 0d5c0bc working 12f08db working ea27954 working 40e37f4 working c050aec working

I used this script:

#!/bin/bash

set -e
source ~/files/ai/envs/exllama/bin/activate

rm -rf exllama
cp -r exllama.bkp exllama
touch commits.txt
cd exllama
git checkout dev

for commit in $(git log --oneline | cut -f1 -d' '); do
  if grep -q "$commit" ../commits.txt; then
    echo "Skipping $commit, already processed."
    continue
  fi

  git checkout "$commit"
  pip install .

  PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  python test_inference.py \
    -m ~/storage/quants/exl/Qwen2.5-0.5B-Instruct-4.0bpw/ \
    -p 'Once upon a time,' --token 256 --gpu_split 0.1,25 \
  && echo "$commit working" >> ../commits.txt \
  || echo "$commit segfault" >> ../commits.txt

  git switch -
done

https://github.com/turboderp/exllamav2/commit/144c576bdb468996b710c058d743d42b87b2f115 is the issue, seems more reasonable

iamwavecut commented 1 week ago

I can confirm this, but I just removed the expandable segments from my environment because I thought it was a personal setup issue.

turboderp commented 1 week ago

I think I managed to reproduce this, and there should be a fix in the latest commit in the dev branch. Hopefully.

ThomasBaruzier commented 1 week ago
(exllama) ~/files/ai/tabby/exllama python test_inference.py     -m ~/storage/quants/exl/Qwen2.5-0.5B-Instruct-4.0bpw/     -p 'Once upon a time,' --token 256 --gpu_split 0.1,25
 -- Model: /home/user/storage/quants/exl/Qwen2.5-0.5B-Instruct-4.0bpw/
 -- Options: ['gpu_split: 0.1,25']
Loading: /home/user/storage/quants/exl/Qwen2.5-0.5B-Instruct-4.0bpw/ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:04 0:00:00
 -- Loaded model in 5.2623 seconds
 -- Loading tokenizer...
 -- Warmup...
 -- Generating...

Once upon a time, there was a fire monster named Kai. He had many dangerous spells and was very scary. One day, he tricked Nemo into leaving his house. However, Nemo has a fire monster at home. In the house, there were also lots of fire monsters. Nemo is too scared to leave, so he asked his parents for help.
I asked you today to help Kai.

A fire monster walked in from the back and started attacking Nemo. At that moment, Nemo was very scared. How should he respond?
To help you, let's look at the scenario and the options provided:
1. Nemo took action and rescued Kai.
2. Nemo took action but Kai did not respond.
3. Nemo remained where he was and continued to wait for Kai after giving him a drink.
4. Nemo attacked Kai without taking any action.

After analyzing the options, the correct answer is that Nemo should have taken action and rescued Kai. By doing so, Nemo was able to avoid the deadly fire monster and live happily in the swamp. The first option (which is incorrect) shows that Nemo did not respond and instead waited for Kai to be rescued despite being scared. The second option (which is incorrect) shows that Nemo

 -- Response generated in 3.00 seconds, 256 tokens, 85.24 tokens/second (includes prompt eval.)

What can I say You're incredible