turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.18k stars 233 forks source link

Can I not use flash attention? Because the model needs to be deployed to nvidia T4 #480

Closed vikotse closed 2 weeks ago

vikotse commented 4 weeks ago

How can I use the exl2 model without flash attention.

turboderp commented 4 weeks ago

If you don't have flash-attn installed, ExLlama will still work, falling back to xformers if available, otherwise PyTorch matmul attention.

The dynamic generator requires flash-attn 2.5.7+ to use paged attention, but there is a fallback mode you can use if you add paged = False when creating the generator. This only works at max_batch_size = 1.

bjohn22 commented 3 weeks ago

I m facing the same issue. Using Nvidia V100. RuntimeError: FlashAttention only supports Ampere GPUs or newer. Please advice on how to resolve this issue:

` import sys, os

sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(file))))

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer, Timer from exllamav2.generator import ExLlamaV2DynamicGenerator

model_dir = "/dbfs/tmp/llama_3_8B_inst_exl2_6bpw" config = ExLlamaV2Config(model_dir) model = ExLlamaV2(config) cache = ExLlamaV2Cache(model, max_seq_len = 32768, lazy = True) model.load_autosplit(cache, progress = True)

print("Loading tokenizer...") tokenizer = ExLlamaV2Tokenizer(config)

Initialize the generator with all default parameters

generator = ExLlamaV2DynamicGenerator( model = model, cache = cache, tokenizer = tokenizer, paged=False, max_batch_size = 1 )

max_new_tokens = 250

Warmup generator. The function runs a small completion job to allow all the kernels to fully initialize and

autotune before we do any timing measurements. It can be a little slow for larger models and is not needed

to produce correct output.

generator.warmup()

Generate one completion, using default settings

prompt = "Once upon a time,"

with Timer() as t_single: output = generator.generate(prompt = prompt, max_new_tokens = max_new_tokens, add_bos = True)

print("-----------------------------------------------------------------------------------") print("- Single completion") print("-----------------------------------------------------------------------------------") print(output) print()

Do a batched generation

prompts = [ "Once upon a time,", "The secret to success is", "There's no such thing as", "Here's why you should adopt a cat:", ]

with Timer() as t_batched: outputs = generator.generate(prompt = prompts, max_new_tokens = max_new_tokens, add_bos = True)

for idx, output in enumerate(outputs): print("-----------------------------------------------------------------------------------") print(f"- Batched completion #{idx + 1}") print("-----------------------------------------------------------------------------------") print(output) print()

print("-----------------------------------------------------------------------------------") print(f"speed, bsz 1: {max_new_tokens / t_single.interval:.2f} tokens/second") print(f"speed, bsz {len(prompts)}: {max_new_tokens * len(prompts) / t_batched.interval:.2f} tokens/second")`

turboderp commented 3 weeks ago

@bjohn22 That error is generated by flash-attn itself, not by ExLlama, so I would assume you have flash-attn installed. ExLlama will attempt to use the library if it's present. whether or not you're using the unpaged fallback mode.

So I would just uninstall flash-attn if you can't use it anyway, then the fallback mode should work. Or set config.no_flash_attn = True to tell ExLlama to ignore it, before model.load_autosplit().

If you have some GPUs that support it and some that don't, you could use the CUDA_VISIBLE_DEVICES env variable to only expose the compatible GPUs to the process.

turboderp commented 2 weeks ago

Please reopen if that solution isn't sufficient.