Closed vikotse closed 2 weeks ago
If you don't have flash-attn installed, ExLlama will still work, falling back to xformers if available, otherwise PyTorch matmul attention.
The dynamic generator requires flash-attn 2.5.7+ to use paged attention, but there is a fallback mode you can use if you add paged = False
when creating the generator. This only works at max_batch_size = 1
.
I m facing the same issue. Using Nvidia V100. RuntimeError: FlashAttention only supports Ampere GPUs or newer.
Please advice on how to resolve this issue:
` import sys, os
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer, Timer from exllamav2.generator import ExLlamaV2DynamicGenerator
model_dir = "/dbfs/tmp/llama_3_8B_inst_exl2_6bpw" config = ExLlamaV2Config(model_dir) model = ExLlamaV2(config) cache = ExLlamaV2Cache(model, max_seq_len = 32768, lazy = True) model.load_autosplit(cache, progress = True)
print("Loading tokenizer...") tokenizer = ExLlamaV2Tokenizer(config)
generator = ExLlamaV2DynamicGenerator( model = model, cache = cache, tokenizer = tokenizer, paged=False, max_batch_size = 1 )
max_new_tokens = 250
generator.warmup()
prompt = "Once upon a time,"
with Timer() as t_single: output = generator.generate(prompt = prompt, max_new_tokens = max_new_tokens, add_bos = True)
print("-----------------------------------------------------------------------------------") print("- Single completion") print("-----------------------------------------------------------------------------------") print(output) print()
prompts = [ "Once upon a time,", "The secret to success is", "There's no such thing as", "Here's why you should adopt a cat:", ]
with Timer() as t_batched: outputs = generator.generate(prompt = prompts, max_new_tokens = max_new_tokens, add_bos = True)
for idx, output in enumerate(outputs): print("-----------------------------------------------------------------------------------") print(f"- Batched completion #{idx + 1}") print("-----------------------------------------------------------------------------------") print(output) print()
print("-----------------------------------------------------------------------------------") print(f"speed, bsz 1: {max_new_tokens / t_single.interval:.2f} tokens/second") print(f"speed, bsz {len(prompts)}: {max_new_tokens * len(prompts) / t_batched.interval:.2f} tokens/second")`
@bjohn22 That error is generated by flash-attn itself, not by ExLlama, so I would assume you have flash-attn installed. ExLlama will attempt to use the library if it's present. whether or not you're using the unpaged fallback mode.
So I would just uninstall flash-attn if you can't use it anyway, then the fallback mode should work. Or set config.no_flash_attn = True
to tell ExLlama to ignore it, before model.load_autosplit()
.
If you have some GPUs that support it and some that don't, you could use the CUDA_VISIBLE_DEVICES env variable to only expose the compatible GPUs to the process.
Please reopen if that solution isn't sufficient.
How can I use the exl2 model without flash attention.