turboderp exllamav2 issues

turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs

MIT License

3.56k stars 273 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

Using HF Safetensors

#252 cdreetz closed 9 months ago
7
add openchat prompt format

#251 eramax closed 9 months ago
0
how to implement the backend of dynamic batch?

#250 tanklandry closed 3 months ago
1
Can load GPTQ models fine, but when running Can't infere gptq models, i get the follow tracebak

#249 userbox020 closed 3 months ago
1
Any examples on long inputs on rope scaled model?

#248 sreeprasannar closed 3 months ago
1
Quantizing goliath120b @ 3bpw : calibration perplexity (quant): 2745.1239

#246 alexconstant9108 closed 9 months ago
8
Mistral fails/garbage at context > 8192, transformers works fine

#245 matatonic closed 9 months ago
2
Feature request: EAGLE

#244 vt404v2 closed 6 months ago
1
Batched flash attention

#243 fahadh4ilyas closed 2 months ago
3
example_batchprocessing.py

#242 Kerushii closed 7 months ago
0
feat: frequency and presence penalty

#241 AlpinDale closed 9 months ago
0
Batched generation with flash attention

#240 fahadh4ilyas closed 9 months ago
2
feat: add top-A sampling

#239 AlpinDale closed 9 months ago
0
Return token probabilities in generator.stream()

#238 ivsanro1 closed 8 months ago
3
add flash attention feature to different seqlen batch

#237 fahadh4ilyas closed 9 months ago
1
What is "an improved exllamav2 quant method"?

#236 yamosin closed 9 months ago
2
Attempted to quant a custom MoE model, Plap-8x13b, and get an error.

#235 NiriProject closed 3 months ago
1
TypeError: ExLlamaV2Tokenizer.encode() got an unexpected keyword argument 'return_offsets'

#234 Rajmehta123 closed 9 months ago
1
[ROCM] [GFX1030] no output

#233 IMbackK closed 3 months ago
2
The answers from model with batching return different answer even with the same input

#232 fahadh4ilyas closed 3 months ago
15
Fix encoder in MMLU benchmark

#231 dvdtoth closed 10 months ago
0
How to use gpu_split in inference.py example

#230 irthomasthomas closed 3 months ago
1
Installation instructions "Method 1" dysfunctional

#229 takosalad closed 2 weeks ago
2
Merge experimental

#228 turboderp closed 10 months ago
0
Merge changes from master

#227 turboderp closed 10 months ago
0
cache.clone() is not creating a copy of the cache

#226 hidoba closed 10 months ago
1
CPU offloading

#225 bibidentuhanoi closed 10 months ago
2
Stop conditions and exclude prompt for Base generator

#224 SinanAkkoyun closed 2 months ago
2
Mixtral

#223 nivibilla closed 10 months ago
13
some GPTQ models can not be loaded anymore

#222 sammyf closed 10 months ago
3
Support DragonFox style "BaNnbAnN"

#221 Kerushii closed 10 months ago
1
Is seed actually used?

#220 richardburleigh closed 10 months ago
1
DeepSeek: ValueError: bytes must be in range(0, 256)

#219 SinanAkkoyun closed 9 months ago
2
Fixed multi file and wildcard args

#218 SinanAkkoyun closed 9 months ago
13
add QuiP quant support

#217 waters222 opened 10 months ago
3
About YiTokenizer errors

#216 redwoodzero0 closed 9 months ago
2
ExLlamaV2Cache_8bit does not work with multiple_caches.py example

#215 lopuhin closed 10 months ago
2
Error quantizing models on recent commit

#213 brucethemoose closed 10 months ago
4
How to clear cache / reset the cache so that model doesnt remember the response earlier?

#212 Rajmehta123 closed 10 months ago
2
(Oobabooga) Can't load GPTQ models anymore with ExLlama-V2 0.0.10

#211 Daviljoe193 closed 10 months ago
2
Allow padding data instead of concatenating when generating calibration dataset

#209 ivsanro1 closed 2 weeks ago
1
Adding return_lowest_perplexity

#206 ziadloo opened 10 months ago
0
Added draft model rope scale to chat example

#204 SinanAkkoyun closed 10 months ago
0
Difference between gemm_half_q_half_gptq_kernel and gemm_half_q_half_kernel

#202 frankxyy closed 10 months ago
0
Quantization error "Warning: Applied additional damping" and "Hessian"

#201 yamosin closed 10 months ago
2
Generating a batch of different propmpt sizes, the shorter prompts tend to suffer

#200 ziadloo closed 3 months ago
7
flash attention does nothing

#199 Tedy50 closed 5 months ago
5
Support for no_repeat_ngram_size

#198 anujnayyar1 closed 2 weeks ago
1
Support GPT2 tokenizer for CausalLM 72b

#196 CyberTimon closed 10 months ago
6
Fix Unicode errors when loading files

#195 bdashore3 closed 10 months ago
0

Previous Next