turboderp exllamav2 issues

turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs

MIT License

3.69k stars 283 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

An issue with gemma2-27b-it related to measurement

#681 antonovkz closed 12 hours ago
5
[BUG] RuntimeError: index 1000000000 is out of bounds

#680 xonfour opened 2 days ago
3
[BUG] Very slow Generation with Paged Attention

#679 rjmehta1993 closed 1 day ago
6
[REQUEST] Passing cache to and from generate() function for use in loop

#678 cmunna0052 closed 2 days ago
2
[BUG] Out of memory from a 2.4bpw 70B parameter model

#677 cmunna0052 closed 3 days ago
3
[BUG] Async with Paged Attention Reduces accuracy

#676 rjmehta1993 closed 2 days ago
8
[REQUEST] Can we have 1.0/1.5 bpw internally?

#675 Originalimoc opened 5 days ago
1
[BUG] [Qwen] Draft model produce garbage output

#674 Nepherpitou opened 1 week ago
4
[REQUEST] Convert.py: Option to skip measurement when setting 8.0/8.0

#673 Originalimoc opened 1 week ago
0
[REQUEST] Support for a Qwen based vision model

#672 TyraVex opened 1 week ago
2
[QUESTION] Does exllamav2 support no-dequant inference?

#670 AaronZLT opened 2 weeks ago
1
[REQUEST] Synthetic Data generation features

#669 AstrisCantCode opened 2 weeks ago
3
[PAPER] New quant method with SOTA quality and speed: QTIP

#668 TyraVex opened 3 weeks ago
0
improve installation experience

#666 SecretiveShell closed 2 weeks ago
1
[BUG] How can we increase or reduce the cache size

#665 royallavanya140 closed 1 week ago
1
[REQUEST] Alternative way to the Pytorch environment variables on Windows to set Pytorch memory management parameters

#664 Nexesenex opened 3 weeks ago
5
[BUG] Out of memory on dual 3090 setup

#663 joshuakoh1 closed 3 weeks ago
2
[BUG] AMD - Out of memory errors despite having plenty of VRAM

#662 RSAStudioGames opened 3 weeks ago
0
[REQUEST] Modify strings probability, rather than outright banning with banned_strings

#661 atisharma closed 1 month ago
4
[REQUEST] Faster 6/8-bit EXL2 quantization

#660 grimulkan opened 1 month ago
0
Torch 2.5

#659 bdashore3 closed 2 weeks ago
0
[REQUEST] Llama 3.2 Vision Support (or already exists?)

#658 grimulkan opened 1 month ago
13
Implementation of logit threshold sampler and confidence breaker

#657 anchortense opened 1 month ago
0
[BUG] Appending-Runtime-LoRA-weights

#656 royallavanya140 opened 1 month ago
2
[BUG] Convert script fails to run on `master` branch as of v0.2.3

#655 iamwavecut opened 1 month ago
5
feat: try to create `out_dir` if doesn't exist

#654 iamwavecut closed 1 month ago
1
[REQUEST] create the output directory during the quantization process

#653 Nexesenex closed 1 month ago
2
[REQUEST]Is it possible to load a model as NF4 and convert it to Exl2?

#652 charleswg closed 1 month ago
2
[BUG] Installation is failing for AMD MI60 (gfx906) with ROCm 6.1 and 6.2

#651 Said-Akbar closed 1 month ago
1
AntiSlop / banned strings

#650 sam-paech closed 1 month ago
1
enable module type checking

#649 SecretiveShell closed 3 weeks ago
0
[BUG] AttributeError: module 'exllamav2_ext' has no attribute 'safetensors_free_pinned_buffer'

#648 Katehuuh closed 1 month ago
1
[BUG] `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` crashes on model loading since 0.2.3

#647 ThomasBaruzier closed 1 month ago
7
[REQUEST] runtime flag to disable XTC sampler

#646 avidwriter closed 1 month ago
4
[BUG] Qwen 2.5 32B quantization produces artifact at any level

#644 Nepherpitou closed 1 month ago
2
Add YaRN scaling for Qwen 2.5

#642 Downtown-Case closed 1 month ago
3
[REQUEST] Implement Transformer's YaRN Scaling for Long Context in Supported Models (e.g. Qwen 2.5)

#641 Downtown-Case closed 1 month ago
4
[REQUEST] "Antislop" sampler

#640 Downtown-Case closed 1 month ago
2
[BUG] RAM UTILISATION IS INCREASING RAPIDLY

#639 UTSAV-44 opened 1 month ago
1
Question about dequantization

#638 HaoWeiWang closed 1 month ago
1
Add more args to humaneval

#637 LlamaEnjoyer closed 1 month ago
0
Added draft token count as parameter to chat.py

#635 SinanAkkoyun closed 1 month ago
1
Add `ExLlamaV2Sampler.Settings.logits_processor`

#634 lapp0 opened 2 months ago
4
[BUG] exllamav2-0.2.2+cu118.torch2.4.0-cp310-cp310-win_amd64.whl Version seems missing under releases.

#633 Nrgte closed 1 month ago
1
[BUG] chat-instruct Llama 3.1 end word "assistant "

#632 Katehuuh closed 2 months ago
4
[REQUEST] Is it possible and a lot of trouble to support flux?

#631 Ph0rk0z opened 2 months ago
4
[BUG] Random slowdowns in tensor parallel.

#630 Ph0rk0z opened 2 months ago
2
[REQUEST] Support Yarn for Qwen 2.5 >32K

#629 Downtown-Case closed 1 month ago
1
[BUG] Qwen 2.5 34B returns garbage at certain quantization levels, but not others

#628 Downtown-Case closed 2 months ago
6
[BUG] Failed to quantize Qwen2.5-Math-72B-Instruct: Measurement/inference error (3): hidden_states

#627 Orion-zhen opened 2 months ago
7