4张M40 配置，使用accelerate启动训练，出现TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'

Micla-SHL commented 2 days ago

我在5天前更新的Factory代码。今晚启用accelerate 想进行多GPU训练。查阅以前版本的readme.

先执行了accelerate config

我的配置图下:

compute_environment: LOCAL_MACHINE
debug: true
deepspeed_config:
  deepspeed_config_file: /Micla/Project/LLaMA-Factory/config/ds_config.json
  zero3_init_flag: true
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_config:
  dynamo_backend: INDUCTOR
  dynamo_mode: default
  dynamo_use_dynamic: false
  dynamo_use_fullgraph: true
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

deepspeed_config_file:内容如下:

{
  "train_batch_size": 32,
  "gradient_accumulation_steps": 8,
  "steps_per_print": 2000,
  "zero_optimization": {
    "stage": 3,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "contiguous_gradients": true
  },
  "fp16": {
    "enabled": true
  }
}

这是我的启动命令：（我是用一遍webui训练界面自动获取相关命令后再修改前缀）

accelerate launch /Micla/Project/LLaMA-Factory/src/train.py \
    --stage sft \
    --do_train True \
    --model_name_or_path THUDM/glm-4-9b \
    --preprocessing_num_workers 16 \
    --finetuning_type lora \
    --template default \
    --flash_attn auto \
    --dataset_dir data \
    --dataset identity,wikipedia_zh \
    --cutoff_len 1024 \
    --learning_rate 5e-05 \
    --num_train_epochs 3.0 \
    --max_samples 100000 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --max_grad_norm 1.0 \
    --logging_steps 5 \
    --save_steps 100 \
    --warmup_steps 0 \
    --optim adamw_torch \
    --packing False \
    --report_to none \
    --output_dir saves/GLM-4-9B/lora/train_2024-07-01-02-00-52 \
    --fp16 True \
    --plot_loss True \
    --ddp_timeout 180000000 \
    --include_num_input_tokens_seen True \
    --deepspeed /Micla/Project/LLaMA-Factory/config/ds_config.json \
    --lora_rank 8 \
    --lora_alpha 16 \
    --lora_dropout 0 \
    --lora_target all

出现的错误：

[rank3]:   File "/home/micla/.conda/envs/llama_factory/lib/python3.11/site-packages/accelerate/accelerator.py", line 1618, in _prepare_deepspeed
[rank3]:     "train_batch_size": batch_size_per_device
[rank3]:                         ^^^^^^^^^^^^^^^^^^^^^
[rank3]: TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'

我觉得应该是配置上有点问题，但是我发现不了，想请求帮助。我的机器是M40*4 ，询问了GPT4告诉我是不支持bf16，上述问题也先在GPT询问，但是它一直让我修改配置，错误还是持续，所以想在这问问

hiyouga commented 2 days ago

使用：https://github.com/hiyouga/LLaMA-Factory/blob/main/examples/deepspeed/ds_z3_config.json

Micla-SHL commented 2 days ago

使用：https://github.com/hiyouga/LLaMA-Factory/blob/main/examples/deepspeed/ds_z3_config.json

谢谢，晚点我会试试

hiyouga / LLaMA-Factory

4张M40 配置，使用accelerate启动训练，出现TypeError: unsupported operand type(s) for *: 'NoneType' and 'int' #4626