SFT Mixtral-8x22-instruct OOM error

campio97 commented 1 month ago

Hello, I am trying to fine-tune the mixtral-8x22b-instruct model but I keep getting the OOM error. I am using 3x A100 gpus for a total of 240gb of vram. I am using QLORA 4bit. After the first finetuning step it goes to OOM error.

My dataset consists of about 2000 records and they are all quite long texts, in some cases I think one record corresponds to about 30000 tokens.

Here is my "accelerate" configuration:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_forward_prefetch: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_offload_params: true # offload may affect training speed
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1 # the number of nodes
num_processes: 3 # the number of GPUs in all nodes
rdzv_backend: static 
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Here is my LLaMA-Factory configuration:

### model
model_name_or_path: /workspace/models/mixtral8x22
quantization_bit: 4

### method
stage: sft
do_train: true
finetuning_type: lora
lora_target: q_proj,v_proj,k_proj,o_proj
freeze_trainable_layers: 8

### QLORA
lora_dropout: 0.1
lora_rank: 8
lora_alpha: 32

### ddp
ddp_timeout: 180000000

### dataset
dataset: massime_tag_sentenze
template: mistral
cutoff_len: 64000
max_samples: 3000
overwrite_cache: true
preprocessing_num_workers: 8

### output
output_dir: saves/mixtral8x22/lora/sft
logging_steps: 10
save_steps: 100
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 1.0e-4
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0.1

### eval
val_size: 0.1
per_device_eval_batch_size: 1
evaluation_strategy: steps
eval_steps: 500

Error:

  File "/workspace/LLaMA-Factory/src/train.py", line 14, in <module>
    main()
  File "/workspace/LLaMA-Factory/src/train.py", line 5, in main
    run_exp()
  File "/workspace/LLaMA-Factory/src/llamafactory/train/tuner.py", line 33, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/workspace/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 73, in run_sft
    train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1885, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2216, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3250, in training_step
    self.accelerator.backward(loss)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2121, in backward
    self.scaler.scale(loss).backward(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 522, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 289, in apply
    return user_fn(self, *args)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py", line 319, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 364.00 MiB. GPU 0 has a total capacity of 79.15 GiB of which 50.12 MiB is free. Process 893681 has 78.98 GiB memory in use. Of the allocated memory 75.62 GiB is allocated by PyTorch, and 2.70 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

What am I doing wrong? Thank you in advance for your help

AlexYoung757 commented 1 month ago

modify “cutoff_len” such as cutoff_len=4096

campio97 commented 1 month ago

modify “cutoff_len” such as cutoff_len=4096

Why? Does cutoff_length cuts my records to the maximum set length ? I need a length far greater than 4096

hiyouga commented 1 month ago

large cutoff len needs larger VRAM, it cannot fit into 3 * A100 GPUs

campio97 commented 1 month ago

I noticed that the issue is not so much about how many GPUs I use because the computation is parallelized, and the single GPU has 80GB of VRAM, which is not enough with a long context. So I tried to activate LongLoRA with S^2-Attn, but the logs show a message that it is not supported for this model (Mixtral8x22b-instruct). Is it correct that it is not supported, or is it a bug?

hiyouga commented 1 month ago

s2attn only supports llama model for now

campio97 commented 1 month ago

Thank you

hiyouga / LLaMA-Factory

SFT Mixtral-8x22-instruct OOM error #4071