Closed hieuhthh closed 1 month ago
same with you。I am lora fintuning Qwen2-7b with 15k context length on L20(48GB)and OOM
same here, I'm trying to use multiple A100(80G)to lora fine-tune with context length 32k, keep getting OOM.
So any solutions yet?
I used the LongLora training method to save memory by adding the parameter "shift_attn: true" . The principle of the method is described here: https://hkaift.com/hk/%E9%95%B7%E6%96%87%E6%9C%AC%E4%B8%AD%E5%BE%AE%E8%AA%BF%E5%A4%A7%E5%9E%8B%E8%AA%9E%E8%A8%80%E6%A8%A1%E5%9E%8B%E7%9A%84%E8%A7%A3%E6%B1%BA%E6%96%B9%E6%A1%88-longlora/
Thank you for your suggestion, but is there any way to do it with Full Finetuning (not LORA)
I don't know, even LongLora currently only supports LLama series https://github.com/hiyouga/LLaMA-Factory/issues/4071#issuecomment-2152793097
Good news, how that works? Full fine-tuning and the parameter "shift_attn: true"? Or just replaced 7B with Qwen2-1.5B.
I think I was wrong about somethings; it also shows the log LongLora does not support. I can finetune 25k tokens total using Qwen2-1.5B and 8xH100, use DeepSpeed.
Well, hoping to find a way to spread long content across multiple nodes, I tried multiple nodes, but it just seemed to parallelize, a single GPU would still OOM.
That is totally correct. Have you tried to train with quantization.
I haven't tried quantization at all, maybe I can
How can we get the admin/mod to pay attention to this issue, assign someone to it, offer advice, and start fixing it? 😄
what do you mean by train with quantization? like qlora+fsdp? i tried with 32k context using 8xA100, but still get OOM for 70B model.
I mean can we full-finetune with quantization? It seem like option quantize bit will only apply for Lora stuff.
DeepSpeed-Ulysses may help,but it looks like llama-factory doesn't support it yet. Same here #5207
I used the LongLora training method to save memory by adding the parameter "shift_attn: true" . The principle of the method is described here: https://hkaift.com/hk/%E9%95%B7%E6%96%87%E6%9C%AC%E4%B8%AD%E5%BE%AE%E8%AA%BF%E5%A4%A7%E5%9E%8B%E8%AA%9E%E8%A8%80%E6%A8%A1%E5%9E%8B%E7%9A%84%E8%A7%A3%E6%B1%BA%E6%96%B9%E6%A1%88-longlora/
hi,do you use llama3.1? I find there is a dependency conflict where llama3.1 needs transformers==4.43.2 while longlora needs transformers<=4.42.4
I used the LongLora training method to save memory by adding the parameter "shift_attn: true" . The principle of the method is described here: https://hkaift.com/hk/%E9%95%B7%E6%96%87%E6%9C%AC%E4%B8%AD%E5%BE%AE%E8%AA%BF%E5%A4%A7%E5%9E%8B%E8%AA%9E%E8%A8%80%E6%A8%A1%E5%9E%8B%E7%9A%84%E8%A7%A3%E6%B1%BA%E6%96%B9%E6%A1%88-longlora/
hi,do you use llama3.1? I find there is a dependency conflict where llama3.1 needs transformers==4.43.2 while longlora needs transformers<=4.42.4
Yes I had this problem too. I solved it by creating a new conda environment and installed the latest version of the llama-factory.
I used the LongLora training method to save memory by adding the parameter "shift_attn: true" . The principle of the method is described here: https://hkaift.com/hk/%E9%95%B7%E6%96%87%E6%9C%AC%E4%B8%AD%E5%BE%AE%E8%AA%BF%E5%A4%A7%E5%9E%8B%E8%AA%9E%E8%A8%80%E6%A8%A1%E5%9E%8B%E7%9A%84%E8%A7%A3%E6%B1%BA%E6%96%B9%E6%A1%88-longlora/
hi,do you use llama3.1? I find there is a dependency conflict where llama3.1 needs transformers==4.43.2 while longlora needs transformers<=4.42.4
Yes I had this problem too. I solved it by creating a new conda environment and installed the latest version of the llama-factory.
thanks for your reply. i also sovled it by modifying the requirement check. And i have another question, now i do continue pretraining of pubemd corpus based on llama3.1-8b, cutoff_length=12000, use long lora; is it suppose better than cutoff_length=2048 for example like this issue https://github.com/hiyouga/LLaMA-Factory/issues/4657
I used the LongLora training method to save memory by adding the parameter "shift_attn: true" . The principle of the method is described here: https://hkaift.com/hk/%E9%95%B7%E6%96%87%E6%9C%AC%E4%B8%AD%E5%BE%AE%E8%AA%BF%E5%A4%A7%E5%9E%8B%E8%AA%9E%E8%A8%80%E6%A8%A1%E5%9E%8B%E7%9A%84%E8%A7%A3%E6%B1%BA%E6%96%B9%E6%A1%88-longlora/
hi,do you use llama3.1? I find there is a dependency conflict where llama3.1 needs transformers==4.43.2 while longlora needs transformers<=4.42.4
Yes I had this problem too. I solved it by creating a new conda environment and installed the latest version of the llama-factory.
thanks for your reply. i also sovled it by modifying the requirement check. And i have another question, now i do continue pretraining of pubemd corpus based on llama3.1-8b, cutoff_length=12000, use long lora; is it suppose better than cutoff_length=2048 for example like this issue https://github.com/hiyouga/LLaMA-Factory/issues/4657
I don't quite understand what "suppose better than cutoff_length=2048" is. Actually, I'm a beginner, but I think it depends on what you're trying to do, if you want longer context, cutoff_length=12000 is better, for the question you're referencing, if it's pre-training, it automatically segments for you, it doesn't truncate, if it's SFT it truncates.
Any update?
try --enable_liger_kernel
and --use_unsloth_gc
It seems that this PR can solve the problem. Any plan on when to merge this PR?
@hiyouga --use_unsloth_gc can work with all situations including qlora+fsdp, ds_zero3, ds_zero3_cpu_offload?
@mces89 yep, it supports almost all settings
Reminder
System Info
llamafactory
version: 0.8.3.dev0Reproduction
Expected behavior
I have 8xH100 (PCIe or SXM, both are okay). I want to fully finetune (at least) a 7B model on my dataset. My dataset has a very long context length (60k tokens for input and output). How can I do this? It seems like this runs out of memory.
If I change the context length to fit the model, for example, the Qwen2-7B with around a 32k model length, it still gets an OOM error. It only works when I reduce it to Qwen2-1.5B and a cutoff_len of 26000. It seems like the model size (7B, 1.5B) and the value of cutoff_len affect the VRAM used in one GPU. (And currently, 80GB for the H100 is the cap; even H100 NVL 94GB won't help much).
Is there any solution to manage a long context length and a long cutoff length? It is also okay to use multi-node training (16xH100 or so) but I do not think it will help in this case.
Thank you!
Others
No response