Full-finetuning Long Context, Big Cutoff Length LLM

hieuhthh commented 3 months ago

Reminder

[X] I have read the README and searched the existing issues.

System Info

llamafactory version: 0.8.3.dev0
Platform: Linux-6.5.0-35-generic-x86_64-with-glibc2.35
Python version: 3.11.0
PyTorch version: 2.4.0+cu121 (GPU)
Transformers version: 4.43.3
Datasets version: 2.20.0
Accelerate version: 0.33.0
PEFT version: 0.12.0
TRL version: 0.8.6
GPU type: NVIDIA H100 PCIe
DeepSpeed version: 0.14.4
Bitsandbytes version: 0.43.2

Reproduction

model_name_or_path: Qwen/Qwen2-7B
template: qwen

### method
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json

### dataset
dataset: my_data
cutoff_len: 120000
overwrite_cache: true
preprocessing_num_workers: 64
max_new_tokens: 60000

### output
output_dir: saves/qwen2-7b/full/sft
logging_steps: 10
save_steps: 1000
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 1.0e-4
num_train_epochs: 5.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
val_size: 0.01
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 1000

### log
report_to: wandb

Expected behavior

I have 8xH100 (PCIe or SXM, both are okay). I want to fully finetune (at least) a 7B model on my dataset. My dataset has a very long context length (60k tokens for input and output). How can I do this? It seems like this runs out of memory.

If I change the context length to fit the model, for example, the Qwen2-7B with around a 32k model length, it still gets an OOM error. It only works when I reduce it to Qwen2-1.5B and a cutoff_len of 26000. It seems like the model size (7B, 1.5B) and the value of cutoff_len affect the VRAM used in one GPU. (And currently, 80GB for the H100 is the cap; even H100 NVL 94GB won't help much).

Is there any solution to manage a long context length and a long cutoff length? It is also okay to use multi-node training (16xH100 or so) but I do not think it will help in this case.

Thank you!

Others

No response

hrz394943230 commented 3 months ago

same with you。I am lora fintuning Qwen2-7b with 15k context length on L20（48GB）and OOM

mces89 commented 2 months ago

same here, I'm trying to use multiple A100(80G）to lora fine-tune with context length 32k, keep getting OOM.

hieuhthh commented 2 months ago

So any solutions yet?

zifengdexiatian commented 2 months ago

I used the LongLora training method to save memory by adding the parameter "shift_attn: true" . The principle of the method is described here: https://hkaift.com/hk/%E9%95%B7%E6%96%87%E6%9C%AC%E4%B8%AD%E5%BE%AE%E8%AA%BF%E5%A4%A7%E5%9E%8B%E8%AA%9E%E8%A8%80%E6%A8%A1%E5%9E%8B%E7%9A%84%E8%A7%A3%E6%B1%BA%E6%96%B9%E6%A1%88-longlora/

hieuhthh commented 2 months ago

Thank you for your suggestion, but is there any way to do it with Full Finetuning (not LORA)

zifengdexiatian commented 2 months ago

I don't know, even LongLora currently only supports LLama series https://github.com/hiyouga/LLaMA-Factory/issues/4071#issuecomment-2152793097

zifengdexiatian commented 2 months ago

Good news, how that works? Full fine-tuning and the parameter "shift_attn: true"? Or just replaced 7B with Qwen2-1.5B.

hieuhthh commented 2 months ago

I think I was wrong about somethings; it also shows the log LongLora does not support. I can finetune 25k tokens total using Qwen2-1.5B and 8xH100, use DeepSpeed.

zifengdexiatian commented 2 months ago

Well, hoping to find a way to spread long content across multiple nodes, I tried multiple nodes, but it just seemed to parallelize, a single GPU would still OOM.

hieuhthh commented 2 months ago

That is totally correct. Have you tried to train with quantization.

zifengdexiatian commented 2 months ago

I haven't tried quantization at all, maybe I can

hieuhthh commented 2 months ago

How can we get the admin/mod to pay attention to this issue, assign someone to it, offer advice, and start fixing it? 😄

mces89 commented 2 months ago

what do you mean by train with quantization? like qlora+fsdp? i tried with 32k context using 8xA100, but still get OOM for 70B model.

hieuhthh commented 2 months ago

I mean can we full-finetune with quantization? It seem like option quantize bit will only apply for Lora stuff.

zhoushaoxiang commented 2 months ago

DeepSpeed-Ulysses may help，but it looks like llama-factory doesn't support it yet. Same here #5207

ZJL0111 commented 2 months ago

I used the LongLora training method to save memory by adding the parameter "shift_attn: true" . The principle of the method is described here: https://hkaift.com/hk/%E9%95%B7%E6%96%87%E6%9C%AC%E4%B8%AD%E5%BE%AE%E8%AA%BF%E5%A4%A7%E5%9E%8B%E8%AA%9E%E8%A8%80%E6%A8%A1%E5%9E%8B%E7%9A%84%E8%A7%A3%E6%B1%BA%E6%96%B9%E6%A1%88-longlora/

hi,do you use llama3.1? I find there is a dependency conflict where llama3.1 needs transformers==4.43.2 while longlora needs transformers<=4.42.4

zifengdexiatian commented 2 months ago

I used the LongLora training method to save memory by adding the parameter "shift_attn: true" . The principle of the method is described here: https://hkaift.com/hk/%E9%95%B7%E6%96%87%E6%9C%AC%E4%B8%AD%E5%BE%AE%E8%AA%BF%E5%A4%A7%E5%9E%8B%E8%AA%9E%E8%A8%80%E6%A8%A1%E5%9E%8B%E7%9A%84%E8%A7%A3%E6%B1%BA%E6%96%B9%E6%A1%88-longlora/

hi,do you use llama3.1? I find there is a dependency conflict where llama3.1 needs transformers==4.43.2 while longlora needs transformers<=4.42.4

Yes I had this problem too. I solved it by creating a new conda environment and installed the latest version of the llama-factory.

ZJL0111 commented 2 months ago

I used the LongLora training method to save memory by adding the parameter "shift_attn: true" . The principle of the method is described here: https://hkaift.com/hk/%E9%95%B7%E6%96%87%E6%9C%AC%E4%B8%AD%E5%BE%AE%E8%AA%BF%E5%A4%A7%E5%9E%8B%E8%AA%9E%E8%A8%80%E6%A8%A1%E5%9E%8B%E7%9A%84%E8%A7%A3%E6%B1%BA%E6%96%B9%E6%A1%88-longlora/

hi,do you use llama3.1? I find there is a dependency conflict where llama3.1 needs transformers==4.43.2 while longlora needs transformers<=4.42.4

Yes I had this problem too. I solved it by creating a new conda environment and installed the latest version of the llama-factory.

thanks for your reply. i also sovled it by modifying the requirement check. And i have another question, now i do continue pretraining of pubemd corpus based on llama3.1-8b, cutoff_length=12000, use long lora; is it suppose better than cutoff_length=2048 for example like this issue https://github.com/hiyouga/LLaMA-Factory/issues/4657

zifengdexiatian commented 2 months ago

I used the LongLora training method to save memory by adding the parameter "shift_attn: true" . The principle of the method is described here: https://hkaift.com/hk/%E9%95%B7%E6%96%87%E6%9C%AC%E4%B8%AD%E5%BE%AE%E8%AA%BF%E5%A4%A7%E5%9E%8B%E8%AA%9E%E8%A8%80%E6%A8%A1%E5%9E%8B%E7%9A%84%E8%A7%A3%E6%B1%BA%E6%96%B9%E6%A1%88-longlora/

hi,do you use llama3.1? I find there is a dependency conflict where llama3.1 needs transformers==4.43.2 while longlora needs transformers<=4.42.4

Yes I had this problem too. I solved it by creating a new conda environment and installed the latest version of the llama-factory.

thanks for your reply. i also sovled it by modifying the requirement check. And i have another question, now i do continue pretraining of pubemd corpus based on llama3.1-8b, cutoff_length=12000, use long lora; is it suppose better than cutoff_length=2048 for example like this issue https://github.com/hiyouga/LLaMA-Factory/issues/4657

I don't quite understand what "suppose better than cutoff_length=2048" is. Actually, I'm a beginner, but I think it depends on what you're trying to do, if you want longer context, cutoff_length=12000 is better, for the question you're referencing, if it's pre-training, it automatically segments for you, it doesn't truncate, if it's SFT it truncates.

hieuhthh commented 1 month ago

Any update?

hiyouga commented 1 month ago

try --enable_liger_kernel and --use_unsloth_gc

yetionyo commented 1 month ago

It seems that this PR can solve the problem. Any plan on when to merge this PR?

https://github.com/hiyouga/LLaMA-Factory/pull/4733

mces89 commented 1 month ago

@hiyouga --use_unsloth_gc can work with all situations including qlora+fsdp, ds_zero3, ds_zero3_cpu_offload?

hiyouga commented 1 month ago

@mces89 yep, it supports almost all settings

hiyouga / LLaMA-Factory