hiyouga / LLaMA-Factory

Efficiently Fine-Tune 100+ LLMs in WebUI (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
30.25k stars 3.73k forks source link

8卡训练但实际只用到1张卡 #3343

Closed bird-9 closed 4 months ago

bird-9 commented 4 months ago

Reminder

Reproduction

配置

A100[40G] * 8

训练命令

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch \
    --config_file examples/accelerate/fsdp_config.yaml \
    src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path /opt/models/Qwen-72B-Chat-Int4 \
    --dataset sgpt \
    --dataset_dir data \
    --template qwen \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --output_dir saves/Qwen-72B-Chat-Int4/lora/sft \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --warmup_steps 20 \
    --save_steps 100 \
    --eval_steps 100 \
    --evaluation_strategy steps \
    --load_best_model_at_end \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --max_samples 3000 \
    --val_size 0.1 \
    --ddp_timeout 180000000 \
    --quantization_bit 4 \
    --plot_loss \
    --fp16 \
    --ddp_find_unused_parameters false \
    --upcast_layernorm true

配置文件

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_offload_params: true
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: false
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1 # the number of nodes
num_processes: 8 # the number of GPUs in all nodes
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
gpu_ids: all

Expected behavior

No response

System Info

FO|configuration_utils.py:724] 2024-04-19 13:59:20,332 >> loading configuration file /opt/models/Qwen-72B-Chat-Int4/config.json
[INFO|configuration_utils.py:724] 2024-04-19 13:59:20,337 >> loading configuration file /opt/models/Qwen-72B-Chat-Int4/config.json
[INFO|configuration_utils.py:789] 2024-04-19 13:59:20,340 >> Model config QWenConfig {
  "_name_or_path": "/opt/models/Qwen-72B-Chat-Int4",
  "architectures": [
    "QWenLMHeadModel"
  ],
  "attn_dropout_prob": 0.0,
  "auto_map": {
    "AutoConfig": "configuration_qwen.QWenConfig",
    "AutoModelForCausalLM": "modeling_qwen.QWenLMHeadModel"
  },
  "bf16": false,
  "emb_dropout_prob": 0.0,
  "fp16": true,
  "fp32": false,
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 49152,
  "kv_channels": 128,
  "layer_norm_epsilon": 1e-06,
  "max_position_embeddings": 32768,
  "model_type": "qwen",
  "no_bias": true,
  "num_attention_heads": 64,
  "num_hidden_layers": 80,
  "onnx_safe": null,
  "quantization_config": {
    "bits": 4,
    "damp_percent": 0.01,
    "desc_act": false,
    "group_size": 128,
    "model_file_base_name": "model",
    "model_name_or_path": null,
    "quant_method": "gptq",
    "static_groups": false,
    "sym": true,
    "true_sequential": true
  },
  "rope_theta": 1000000,
  "rotary_emb_base": 1000000,
  "rotary_pct": 1.0,
  "scale_attn_weights": true,
  "seq_length": 32768,
  "softmax_in_fp32": false,
  "tie_word_embeddings": false,
  "tokenizer_class": "QWenTokenizer",
  "transformers_version": "4.39.3",
  "use_cache": true,
  "use_cache_kernel": false,
  "use_cache_quantization": false,
  "use_dynamic_ntk": false,
  "use_flash_attn": "auto",
  "use_logn_attn": false,
  "vocab_size": 152064
}

04/19/2024 13:59:20 - INFO - llmtuner.model.patcher - Loading 4-bit GPTQ-quantized model.
CUDA extension not installed.
CUDA extension not installed.
[INFO|modeling_utils.py:3280] 2024-04-19 13:59:20,507 >> loading weights file /opt/models/Qwen-72B-Chat-Int4/model.safetensors.index.json
[INFO|modeling_utils.py:1417] 2024-04-19 13:59:20,508 >> Instantiating QWenLMHeadModel model under default dtype torch.float16.
[INFO|configuration_utils.py:928] 2024-04-19 13:59:20,509 >> Generate config GenerationConfig {}

/root/miniconda3/envs/LLaMaFactory/lib/python3.10/site-packages/transformers/modeling_utils.py:4225: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
  warnings.warn(
Running tokenizer on dataset (num_proc=16): 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:14<00:00,  5.36 examples/s]
Running tokenizer on dataset (num_proc=16): 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:14<00:00,  5.55 examples/s]04/19/2024 13:59:36 - INFO - llmtuner.model.patcher - Loading 4-bit GPTQ-quantized model.
Running tokenizer on dataset (num_proc=16): 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:14<00:00,  5.34 examples/s]
Running tokenizer on dataset (num_proc=16): 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:15<00:00,  5.58 examples/s]04/19/2024 13:59:36 - INFO - llmtuner.model.patcher - Loading 4-bit GPTQ-quantized model.
CUDA extension not installed.
CUDA extension not installed.
Running tokenizer on dataset (num_proc=16): 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:15<00:00,  5.29 examples/s]
CUDA extension not installed.
CUDA extension not installed.
Running tokenizer on dataset (num_proc=16): 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:15<00:00,  5.26 examples/s]
04/19/2024 13:59:37 - INFO - llmtuner.model.patcher - Loading 4-bit GPTQ-quantized model.
04/19/2024 13:59:37 - INFO - llmtuner.model.patcher - Loading 4-bit GPTQ-quantized model.
Running tokenizer on dataset (num_proc=16): 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:15<00:00,  5.24 examples/s]
Running tokenizer on dataset (num_proc=16): 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:15<00:00,  5.35 examples/s]04/19/2024 13:59:37 - INFO - llmtuner.model.patcher - Loading 4-bit GPTQ-quantized model.
CUDA extension not installed.
CUDA extension not installed.
Running tokenizer on dataset (num_proc=16): 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:15<00:00,  5.21 examples/s]
CUDA extension not installed.
CUDA extension not installed.
04/19/2024 13:59:37 - INFO - llmtuner.model.patcher - Loading 4-bit GPTQ-quantized model.
CUDA extension not installed.
CUDA extension not installed.
Loading checkpoint shards:  24%|████████████████████████████▌                                                                                           | 5/21 [00:03<00:09,  1.69it/s]/root/miniconda3/envs/LLaMaFactory/lib/python3.10/site-packages/transformers/modeling_utils.py:4225: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
  warnings.warn(
Running tokenizer on dataset (num_proc=16): 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:15<00:00,  5.08 examples/s]CUDA extension not installed.
CUDA extension not installed.
/root/miniconda3/envs/LLaMaFactory/lib/python3.10/site-packages/transformers/modeling_utils.py:4225: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
  warnings.warn(
Running tokenizer on dataset (num_proc=16): 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:15<00:00,  5.11 examples/s]
04/19/2024 13:59:37 - INFO - llmtuner.model.patcher - Loading 4-bit GPTQ-quantized model.
/root/miniconda3/envs/LLaMaFactory/lib/python3.10/site-packages/transformers/modeling_utils.py:4225: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
  warnings.warn(
CUDA extension not installed.
CUDA extension not installed.
/root/miniconda3/envs/LLaMaFactory/lib/python3.10/site-packages/transformers/modeling_utils.py:4225: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
  warnings.warn(
/root/miniconda3/envs/LLaMaFactory/lib/python3.10/site-packages/transformers/modeling_utils.py:4225: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
  warnings.warn(
Loading checkpoint shards:  29%|██████████████████████████████████▎                                                                                     | 6/21 [00:03<00:08,  1.73it/s]/root/miniconda3/envs/LLaMaFactory/lib/python3.10/site-packages/transformers/modeling_utils.py:4225: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
  warnings.warn(
/root/miniconda3/envs/LLaMaFactory/lib/python3.10/site-packages/transformers/modeling_utils.py:4225: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
  warnings.warn(
Loading checkpoint shards:  95%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎     | 20/21 [00:11<00:00,  1.80it/s]
Traceback (most recent call last):
  File "/opt/code/LLaMA-Factory/src/train_bash.py", line 14, in <module>
    main()
  File "/opt/code/LLaMA-Factory/src/train_bash.py", line 5, in main
    run_exp()
  File "/opt/code/LLaMA-Factory/src/llmtuner/train/tuner.py", line 33, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/opt/code/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 33, in run_sft
    model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train)
  File "/opt/code/LLaMA-Factory/src/llmtuner/model/loader.py", line 101, in load_model
    model: "PreTrainedModel" = AutoModelForCausalLM.from_pretrained(**init_kwargs)
  File "/root/miniconda3/envs/LLaMaFactory/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 558, in from_pretrained
    return model_class.from_pretrained(
  File "/root/miniconda3/envs/LLaMaFactory/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3531, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/root/miniconda3/envs/LLaMaFactory/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3958, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/root/miniconda3/envs/LLaMaFactory/lib/python3.10/site-packages/transformers/modeling_utils.py", line 812, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/root/miniconda3/envs/LLaMaFactory/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 387, in set_module_tensor_to_device
    new_value = value.to(device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.32 GiB. GPU 0 has a total capacity of 39.39 GiB of which 1.71 GiB is free. Including non-PyTorch memory, this process has 37.66 GiB memory in use. Of the allocated memory 36.12 GiB is allocated by PyTorch, and 148.23 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.ht

使用fsdp_lora训练但在在Loading checkpoint shards 的时候我只看到第一张卡的显存在增长 直到爆显存不足

Others

No response

bird-9 commented 4 months ago

已解决 看到别的issue说不支持训练量化版本!

bird-9 commented 4 months ago

已解决 看到别的issue说不支持训练量化版本!

hiyouga commented 4 months ago

仅支持未量化模型 + quantization_bit 参数