Open sampbarrow opened 1 year ago
ExLlama pre-allocates the whole context, so it uses the same amount of VRAM (roughly) no matter how long your context is. Setting the max sequence length to something really short like with -l 100
could maybe help rule out that it's an OoM error.
I mean, it shouldn't be that, especially with 70b which doesn't use a lot of memory for context thanks to GQA, but rank 64 is still high for a LoRA and when you're targeting all layers I'm guessing adapter_model.bin is a pretty big file. So maybe you are just running out of memory?
Thanks for the response.
ExLlama pre-allocates the whole context, so it uses the same amount of VRAM (roughly) no matter how long your context is. Setting the max sequence length to something really short like with
-l 100
could maybe help rule out that it's an OoM error
Is this the --max_seq_len argument to text-generation-webui? I tried that but the UI on the models still shows 2048 and it can't go lower. I'll just modify the code to lower that (if we're talking about the same thing) or just use exllama via cli. Will report back on that.
I mean, it shouldn't be that, especially with 70b which doesn't use a lot of memory for context thanks to GQA, but rank 64 is still high for a LoRA and when you're targeting all layers I'm guessing adapter_model.bin is a pretty big file. So maybe you are just running out of memory?
Here are the file sizes (in mb):
That is pretty big. You're already bordering on 40 GB for the model + LoRA. Add a gigabyte or two for Torch, and even with GQA there isn't much left for 80 layers of K/V cache.
Not that it's relevant to any bugs in ExLlama, but I'd question if that's a sane size for a LoRA in the first place. I mean, it's got 3 billion parameters (!).
Same error with -l 100
(exllama) root@bb9b8f1170dd:/workspace/exllama# python example_chatbot.py -l 100 -ld ../text-generation-webui/loras/checkpoint-90/ -d ../text-generation-webui/models/TheBloke_Llama-2-70B-GPTQ/
-- Sequence length: 100
-- Temperature: 0.95
-- Top-K: 20
-- Top-P: 0.65
-- Min-P: 0.00
-- Repetition penalty: 1.15
-- Beams: 1 x 1
-- Tokenizer: ../text-generation-webui/models/TheBloke_Llama-2-70B-GPTQ/tokenizer.model
-- Model config: ../text-generation-webui/models/TheBloke_Llama-2-70B-GPTQ/config.json
-- Model: ../text-generation-webui/models/TheBloke_Llama-2-70B-GPTQ/gptq_model-4bit--1g.safetensors
-- Sequence length: 100
-- Tuning:
-- --matmul_recons_thd: 8
-- --fused_mlp_thd: 2
-- --sdp_thd: 8
-- Options: []
-- Groupsize (inferred): None
-- Act-order (inferred): no
!! Model has empty group index (discarded)
-- LoRA config: ../text-generation-webui/loras/checkpoint-90/adapter_config.json
-- Loading LoRA: ../text-generation-webui/loras/checkpoint-90/adapter_model.bin
Chatbort: Hello, User
User: hello
Chatbort:Traceback (most recent call last):
File "/workspace/exllama/example_chatbot.py", line 199, in <module>
gen_token = generator.beam_search()
^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/exllama/generator.py", line 487, in beam_search
if self.settings.beams == 1 and self.settings.beam_length == 1: return self.gen_single_token()
^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/exllama/generator.py", line 326, in gen_single_token
logits = self.model.forward(self.sequence[:, -1:], self.cache, lora = self.lora)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/exllama/model.py", line 924, in forward
r = self._forward(input_ids[:, chunk_begin : chunk_end],
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/exllama/model.py", line 1005, in _forward
hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/exllama/model.py", line 495, in forward
self.self_attn.fused(hidden_states, cache, buffer, self.input_layernorm, lora)
File "/workspace/exllama/model.py", line 375, in fused
key_states = self.repeat_kv(key_states, self.config.num_key_value_groups)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/exllama/model.py", line 312, in repeat_kv
return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
I completely forgot to mention, I implemented your suggestion from this post:
https://github.com/turboderp/exllama/issues/170#issuecomment-1643668780
Since this was trained with bf16. Possibly related?
Not that it's relevant to any bugs in ExLlama, but I'd question if that's a sane size for a LoRA in the first place. I mean, it's got 3 billion parameters (!).
Honestly I have no idea what I'm doing, I just followed the qlora repo / guanaco defaults. Unfortunately I just spent a bunch of time training this so I'm hoping I don't have to start over. Worst case scenario I can merge them all into the base model but I was hoping to test some of these checkpoints individually before paying for all that disk space on this cloud server.
Still seems strange that exllama would OOM at 100 but I can run transformers at 3000-4000 context though? Maybe I'll try a server with more VRAM. I'd just hate to move 200gb of files to find out the issue wasn't actually OOM related.
Hmm, I remember doing some napkin math when someone asked if 70B would fit in 40GB, and my estimate was that it would probably just squeeze into 40GB (single-card) at not-quite-full context, and probably wouldn't fit on two cards of exactly 40GB (e.g. 24+16) with the overhead of an extra card factored in. So running 70B at all is a really tight squeeze already.
My understanding is that ExLlama will keep a loaded LoRA in VRAM separately from the base model weights, and the LoRA weights are read as needed when the generator is triggered, which allows you to swap out LoRAs as often as you'd like without having to reload the entire model. I haven't looked, but Transformers might just plaster the LoRA weights on top of the model weights in VRAM, which would leave that memory open for context instead.
Can't really think of a solution that isn't annoying; you'd either want a little more VRAM (like literally 2GB more), or a smaller LoRA, or the LoRA premerged onto the 70B weights, or a way to irreversibly merge the weights in memory with ExLlama.
Is the 70B GPTQ quant you're using a group-sized one? Going to an ungrouped model from a 128 group-size one would save around 1.4GB. If it's already an ungrouped quant then that's already as small as ExLlama currently supports, though.
Hmm, I remember doing some napkin math when someone asked if 70B would fit in 40GB, and my estimate was that it would probably just squeeze into 40GB (single-card) at not-quite-full context, and probably wouldn't fit on two cards of exactly 40GB (e.g. 24+16) with the overhead of an extra card factored in. So running 70B at all is a really tight squeeze already.
My understanding is that ExLlama will keep a loaded LoRA in VRAM separately from the base model weights, and the LoRA weights are read as needed when the generator is triggered, which allows you to swap out LoRAs as often as you'd like without having to reload the entire model. I haven't looked, but Transformers might just plaster the LoRA weights on top of the model weights in VRAM, which would leave that memory open for context instead.
Can't really think of a solution that isn't annoying; you'd either want a little more VRAM (like literally 2GB more), or a smaller LoRA, or the LoRA premerged onto the 70B weights, or a way to irreversibly merge the weights in memory with ExLlama.
Is the 70B GPTQ quant you're using a group-sized one? Going to an ungrouped model from a 128 group-size one would save around 1.4GB. If it's already an ungrouped quant then that's already as small as ExLlama currently supports, though.
Edit: disregard the below, so I actually am still getting the same error on inference now after loading.
Well I went ahead and loaded it on an a100 just to see and now I have the same problem with exllama that I have with transformers where the lora will sit there for 15-45 minutes or so when loading it. So I guess it was a memory issue but there's something else going on with these files that I really don't understand. They're a bit big but I've loaded larger ones in 10 seconds on much weaker gpus so I have no idea what's going on at this point.
I would just merge them but my understanding is in order to do that I'll need an even bigger server because bnb won't allow you to merge/save if you load in 4bit.
Is that an A100 40GB or 80GB? I think you can probably safely rule out OOMs if it's 80GB.
Ah wait I misunderstood, never mind.
Also, yeah, merging a LoRA is a bit of a pain, since afaik you need to merge the weights onto the full-sized fp16 model, then save it, then run the merged model through GPTQ-for-LLaMA/AutoGPTQ so ExLlama can load it, and that all takes a lot of disk space and patience for something as large as 70B.
Is it possible they're taking so long to load because of the datatype? If Torch doesn't have an efficient bfloat16->float16 function, it might end up in some super-inefficient fallback routine. Maybe try replacing the tensor = tensor.to(torch.float16)
with tensor = zeros_like(tensor, dtype = torch.float16)
. It obviously won't work but might reveal why loading takes so long.
If that is why it's slow, it might be only be slow because the tensor is still in system RAM at that point. Doing the conversion after moving the tensor to the target device might enable some faster CUDA code for the conversion.
@EyeDeck
Is that an A100 40GB or 80GB? I think you can probably safely rule out OOMs if it's 80GB.
Oh yeah sorry it's 80, I was on 48 before so I only set this server up to see if more VRAM would do it.
@turboderp
The weird thing is when I ran exllama from the CLI on the A6000 the lora loaded in a few seconds, but then of course I got that error on inference. On the A100 the lora was slow to load but much faster than it was with transformers, maybe 5-10 mins vs up to an hour.
Anyway, I just tested again, A100 80GB, same illegal memory error, so I don't think it's OOM that's causing that issue at least.
I might just retrain fp16 but I was under the impression that transformers supported bf16 at least in a somewhat optimized way, I couldn't find anyone else complaining about loras taking so long to load which I'd think someone would report if it were due to the bf16 because it's pretty unusable.
Also I just tried tensor = zeros_like(tensor, dtype = torch.float16)
and it just hangs like before, we'll see how long it takes (edited - I originally used numpy.zeros_like which produced an error but then I realized you probably meant torch).
When I load lora for LLaMA 2 70B, I get the same error too (RuntimeError: CUDA error: an illegal memory access was encountered). But I can use lora with no issue with the 13B model. I'm using 2x RTX A5000 (48GB vram total), I tested on A10 48G and got the same error.
I can test with A100 80G but I need to wait for availability.
I was on 48 before
Ah whoops, I apologize, I saw A6000 + turboderp's comment about 40GB and forgot that's the one that's basically a 3090 Ti except with 48GB of GDDR6 (non-X). It's hard to keep all of Nvidia's card numbering schemes straight... In that case, I would expect 70B to fit, no problem, even with a large LoRA. Or at least, that's definitely not an unreasonable expectation (while loading 70B in 40GB at all would be sketchy).
Well, if it works with 7b and 13b it's most likely related to GQA. Everything up until that 70b release has assumed that the number of heads is the same for all of the attention projection layers. Does it still crash with --no_fused_attn
?
I was thought that the gptq model can only using lora file trained in gptq mode. For example: trained by using alpaca_lora_4bit or auto_gpt_q...
in your conversion, that do you mean it is also possible that using qlora or just load the origin model with bnb 4bit and peft can also work? I can directly use the weights generated by qlora?
Well, if it works with 7b and 13b it's most likely related to GQA. Everything up until that 70b release has assumed that the number of heads is the same for all of the attention projection layers. Does it still crash with
--no_fused_attn
?
seems works fine when using no_fused_attn, but sth wrong with my training script so I'm not sure my lora file is correct I will test it more tomorrow..
I can directly use the weights generated by qlora?
If the weights are saved in float16, then yes, it doesn't have to match the model. And it should be possible to convert bfloat16 to float16 on the fly, I just haven't enabled that by default since it's a very recent thing that people started saving LoRAs in bfloat16 and I haven't had an opportunity to test the conversion yet. It's quite a shift in accuracy so it's not given that results are going to be good.
Overall, though, the LoRA is just a bunch of low-rank matrices that are applied in parallel to the the linear layers of the model. They could technically be in any format: GPTQ, float, bfloat, GGML, whatever, as long as the matmul functions are there to support the datatype. Currently ExLlama only supports float16 for the LoRA weights, though, though it will automatically convert float32. You could add other types to automatically convert by modifying lora.py
:
if tensor.dtype == torch.float16:
pass
elif tensor.dtype == torch.float32:
tensor = tensor.to(torch.float16)
elif tensor.dtype == torch.bfloat16:
tensor = tensor.to(torch.float16)
else: raise ValueError(f" ## Error: unsupported tensor dtype in {self.lora_path}")
Or just replace the whole thing with tensor = tensor.to(torch.float16)
if you want to support all datatypes, though that of course has the potential to give poor results if too much precision is lost in the conversion.
Well, if it works with 7b and 13b it's most likely related to GQA. Everything up until that 70b release has assumed that the number of heads is the same for all of the attention projection layers. Does it still crash with
--no_fused_attn
?
Thanks, disabling fused attention fixed my issue!
Well, it's not exactly a fix, cause it should really work with fused attn, but I'll get to that. What I need though is an example 70b LoRA I can test on.
Well, it's not exactly a fix, cause it should really work with fused attn, but I'll get to that. What I need though is an example 70b LoRA I can test on.
adapter_config.zip a very very small lora for test
Unsupported tensor Dtype. Loading the guanaco lora above and llama-70b using exllama_hf. I will try without fused attention.
Unsupported tensor Dtype
Have you updated ExLlama to the latest version? I only added bfloat16 very recently, probably hasn't made it into the library yet.
I just saw..I loaded it afterwards and got the cuda assert. Then I turn off fused attention and it loads but the generation feels really slow.
ok.. nvm.. I think it was just my p2p link to my server lagging the fugg out. Gets 7 it/s with the lora loaded.
A LoRA does add some overhead, especially when it's targeting all layers with rank-64 adapters.
I really would caution everyone training these adapters not to crank up the rank thinking more is automatically better. At this scale you've essentially got an 800-million-parameter model to train, and you can just not do that in a few hours on a couple of A100s, especially when both the forward and backward passes also have to run through the frozen weights of a 70-billion-parameter model.
The original Alpaca LoRA used rank 8 on just two layers. That's why it converges so quickly.
It's choking down at 3k context. To 3-4it/s even. The merged copy shouldn't have this problem but its 128g with no act order and I wanted to try the original first.
I trained at 128 and 256 ranks on 50k items and it wasn't bad.. but that was on a 13b. Maybe it's better to train a reasonable rank against all 4 of the letter layers and not literally everything. OG lora is just 2.
I really would caution everyone training these adapters not to crank up the rank thinking more is automatically better. At this scale you've essentially got an 800-million-parameter model to train, and you can just not do that in a few hours on a couple of A100s, especially when both the forward and backward passes also have to run through the frozen weights of a 70-billion-parameter model.
It's worth noting that the reason most of the QLoRA adaptors are rank-64 is because that's the default rank in the official QLoRA Repo. That's why the Guanaco adapter is like that for instance, as that was deliberately created using the official script as is for authenticity with the original training.
It's choking down at 3k context. To 3-4it/s even. The merged copy shouldn't have this problem but its 128g with no act order and I wanted to try the original first.
Actually Guanaco is available in many GPTQ versions, including 32g with act order as TheBloke recently started doing multiple GPTQs per model. Storing each version in a different branch.
hi,May I ask a slightly off-topic question?
Did anyone who have compared alpaca_lora_4bit/autogptq and qlora?
I remember is the qlorais half as fast as the alpaca_lora_4bit one month ago.
Why seems most of you using qlora? @EliEron @sampbarrow
hi,May I ask a slightly off-topic question?
Did anyone who have compared alpaca_lora_4bit/autogptq and qlora?
I remember is the qlorais half as fast as the alpaca_lora_4bit one month ago.
Why seems most of you using qlora? @EliEron @sampbarrow
I thought it might work better since it trains on all layers so I gave it a shot. The results are very good but I haven't done a 1:1 comparison. My alpaca_lora_4bit trainings were on 33b.
with alpaca_lora_4bit. set target_modules= ["q_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"] and handel the error in model.py can train all layers as well. With the latest branch, Useing fineturn.py to train 70B, the inference is broken while the training works.
I also tried auogptq before, it is slower than alpaca_lora_4bit, but it is able to fully update the model, rather than lora. and with gradient_checkpointing = True, in 2048ctx , 65B gptq, the mem cost is almost the same with lora training(even lower when the r is big. I know the lora is using low rank matrix, and using mush less vram than fully update. but gradient_checkpointing seems works better)
Then the qlora,I guess it there is not self-implemented function in qlora? Like..it is using bnb4bit and peft to load the model and then call train() function provided by transformer. which is the slowest but most stable bucasue peft and transformer are maintianed by hg. Thanks for the explain by @turboderp. Now I know the qlora also works with gptq model.
Is my understanding of these projects correct?
What's driving me crazy now is that there is no 'best' solution, I dont know which one should I keep using...
alpaca_lora_4bit is the fastest but not maintained timely. auogptq is slower but supports more training methods but is maintained even less timely. qlora is the slowest but no any self implemented or hack way, may get more timely maintenance. Also, it may have the lowest accuracy when combined with the GPT-Q model. Also qlora need the origin model file.
Sorry again for off-topic question
Qlora has mildly better perplexity and that probably carries over to training. But as you say, the same modules can be targeted and trained faster. You still sort of need the full weights to merge so there is no way to escape it. Qlora is simply hyped up and easier to use.
Actually Guanaco is available in many GPTQ versions, including 32g with act order as TheBloke recently started doing multiple GPTQs per model. Storing each version in a different branch.
Did not see that when I d/l it. I will look for a 128g act order version. I saw some censorship when using the lora vs the merged for some reason. I'm not sure how the merged never gave disclaimers and the lora did. They should be identical.
For those who struggles with this error in text-generation-webui and couldn't figure out how to switch off fused_attn from this thread (like me): One needs to switch to llama-HF model loader and uncomment the following line in modules/exllama_hf.py:
config.fused_attn = False
For those who struggles with this error in text-generation-webui and couldn't figure out how to switch off fused_attn from this thread (like me): One needs to switch to llama-HF model loader and uncomment the following line in modules/exllama_hf.py:
config.fused_attn = False
THANK YOU. Everyone here is going on and on and on about how they were messing with fused attention, but nobody said WHERE they were messing with it. OMG.
(exllama) root@bb9b8f1170dd:/workspace/exllama# python example_chatbot.py -l 100 -ld ../text-generation-webui/loras/checkpoint-90/ -d ../text-generation-webui/models/TheBloke_Llama-2-70B-GPTQ/ -- Sequence length: 100 -- Temperature: 0.95 -- Top-K: 20 -- Top-P: 0.65 -- Min-P: 0.00 -- Repetition penalty: 1.15 -- Beams: 1 x 1 -- Tokenizer: ../text-generation-webui/models/TheBloke_Llama-2-70B-GPTQ/tokenizer.model -- Model config: ../text-generation-webui/models/TheBloke_Llama-2-70B-GPTQ/config.json -- Model: ../text-generation-webui/models/TheBloke_Llama-2-70B-GPTQ/gptq_model-4bit--1g.safetensors -- Sequence length: 100 -- Tuning: -- --matmul_recons_thd: 8 -- --fused_mlp_thd: 2 -- --sdp_thd: 8 -- Options: [] -- Groupsize (inferred): None -- Act-order (inferred): no !! Model has empty group index (discarded) -- LoRA config: ../text-generation-webui/loras/checkpoint-90/adapter_config.json -- Loading LoRA: ../text-generation-webui/loras/checkpoint-90/adapter_model.bin Chatbort: Hello, User User: hello Chatbort:Traceback (most recent call last): File "/workspace/exllama/example_chatbot.py", line 199, in <module> gen_token = generator.beam_search() ^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/exllama/generator.py", line 487, in beam_search if self.settings.beams == 1 and self.settings.beam_length == 1: return self.gen_single_token() ^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/exllama/generator.py", line 326, in gen_single_token logits = self.model.forward(self.sequence[:, -1:], self.cache, lora = self.lora) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/exllama/model.py", line 924, in forward r = self._forward(input_ids[:, chunk_begin : chunk_end], ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/exllama/model.py", line 1005, in _forward hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/exllama/model.py", line 495, in forward self.self_attn.fused(hidden_states, cache, buffer, self.input_layernorm, lora) File "/workspace/exllama/model.py", line 375, in fused key_states = self.repeat_kv(key_states, self.config.num_key_value_groups) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/exllama/model.py", line 312, in repeat_kv return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Does this can be solved without disable fused attention? because I've got the same error even if I only use 7B model will 32-rank LORA.
cc: @turboderp
Getting this on inference when I have a lora loaded (loading the lora itself doesn't produce any errors).
Using text-generation-webui.
File "/home/user/text-generation-webui/modules/models.py", line 309, in clear_torch_cache torch.cuda.empty_cache() File "/home/user/.local/lib/python3.10/site-packages/torch/cuda/memory.py", line 133, in empty_cache torch._C._cuda_emptyCache() RuntimeError: CUDA error: an illegal memory access was encountered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
I just trained this with qlora, unfortunately I can't use the Transformers loader because it takes between 15-45 minutes (not exaggerating, just waited 45 minutes for the last one to load before giving up) to load a Lora and I can't find any reports of the same issue. So I'm trying to load this with exllama on top of a GPTQ version of llama-2-70b. I'm not even sure if that's possible, but previous loras I've trained with other libraries have worked fine on llama 1 gptq.
I don't think I'm out of VRAM, this is failing on a context size of maybe 20 tokens and I'm on an A6000. Single GPU nothing fancy. I can go up to at least 3000 tokens context with transformers, when I am patient enough to wait the half hour or whatever it takes to load. No problems once it loads.
Possibly relevant args from my qlora training:
--lora_r 64 \ --lora_alpha 16 \ --lora_modules all \ --double_quant \ --quant_type nf4 \ --bf16 \ --bits 4 \ --lora_dropout 0.1
My adapter_config.json if it's relevant:
{ "auto_mapping": null, "base_model_name_or_path": "meta-llama/Llama-2-70b-hf", "bias": "none", "fan_in_fan_out": false, "inference_mode": true, "init_lora_weights": true, "layers_pattern": null, "layers_to_transform": null, "lora_alpha": 16.0, "lora_dropout": 0.1, "modules_to_save": null, "peft_type": "LORA", "r": 64, "revision": null, "target_modules": [ "v_proj", "gate_proj", "k_proj", "down_proj", "up_proj", "o_proj", "q_proj" ], "task_type": "CAUSAL_LM"
This is the file structure of the lora I have, not sure if relevant either: