mymusise / ChatGLM-Tuning

基于ChatGLM-6B + LoRA的Fintune方案
MIT License
3.74k stars 442 forks source link

怎么使用多卡进行训练? #84

Open 1a2cjitenfei opened 1 year ago

1a2cjitenfei commented 1 year ago

默认好像是只使用第一块显卡

joan126 commented 1 year ago

deepspeed

liangwq commented 1 year ago

accelerate+deepspeed可以,我测试了下8卡的可以跑起来 我看看后面怎么加一个pr

liangwq commented 1 year ago

accelerate推理其实也可以用,这个我测试也是没问题的 后面一起加上来

paulcx commented 1 year ago

python3 -m torch.distributed.launch --nproc_per_node 4 \

--nnodes=1 --node_rank=0 --master_addr=xxx --master_port=yyy finetune.py  --dataset_path data/alpaca --lora_rank 8 --per_device_train_batch_size 2 --gradient_accumulation_steps 1 --max_steps 10000 --save_steps 1000 --save_total_limit 2 --learning_rate 2e-5 --fp16 --remove_unused_columns false --logging_steps 50 --output_dir output

直接用pytorch的分布式训练能力也可以做多GPu训练

这个没用到deepspeed吧?

liangwq commented 1 year ago

没有 用的torch的ddp,加两三行代码就搞定 你可以试试,如果好不行,我应该会在明后天搞一个pr上来

paulcx commented 1 year ago

没有 用的torch的ddp,加两三行代码就搞定

你可以试试,如果好不行,我应该会在明后天搞一个pr上来

deepspeed怎么试?

liangwq commented 1 year ago

deepspeed并行化操作如下: 1.增加deepspeed配置参数文件:ds_config_zero3.json `{ "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 },

"optimizer": {
    "type": "AdamW",
    "params": {
        "lr": "auto",
        "betas": "auto",
        "eps": "auto",
        "weight_decay": "auto"
    }
},

"scheduler": {
    "type": "WarmupLR",
    "params": {
        "warmup_min_lr": "auto",
        "warmup_max_lr": "auto",
        "warmup_num_steps": "auto"
    }
},

"zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
        "device": "cpu",
        "pin_memory": true
    },
    "offload_param": {
        "device": "cpu",
        "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
},

"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false

} 2.修改modeling_chatglm.py 266行处bug,否则会报错“RuntimeError: expected mask dtype to be Bool but got Half” if not (attention_mask == 0).all(): attention_scores.maskedfill(attention_mask.byte(), -10000.0)` 增加.byte()一个把attention_mask改成bool值

3.执行以下命令,就可以分布式跑了: torchrun --nproc_per_node=2 finetune.py --dataset_path data/alpaca --lora_rank 8 --per_device_train_batch_size 2 --gradient_accumulation_steps 1 --max_steps 10000 --save_steps 1000 --save_total_limit 2 --learning_rate 2e-5 --fp16 --remove_unused_columns false --logging_steps 50 --output_dir output --deepspeed ds_config_zero3.json

firslov commented 1 year ago

deepspeed并行化操作如下: 1.增加deepspeed配置参数文件:ds_config_zero3.json `{ "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 },

"optimizer": {
    "type": "AdamW",
    "params": {
        "lr": "auto",
        "betas": "auto",
        "eps": "auto",
        "weight_decay": "auto"
    }
},

"scheduler": {
    "type": "WarmupLR",
    "params": {
        "warmup_min_lr": "auto",
        "warmup_max_lr": "auto",
        "warmup_num_steps": "auto"
    }
},

"zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
        "device": "cpu",
        "pin_memory": true
    },
    "offload_param": {
        "device": "cpu",
        "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
},

"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false

}2.修改modeling_chatglm.py 266行处bug,否则会报错“RuntimeError: expected mask dtype to be Bool but got Half”if not (attention_mask == 0).all(): attention_scores.maskedfill(attention_mask.byte(), -10000.0)` 增加.byte()一个把attention_mask改成bool值

3.执行以下命令,就可以分布式跑了: torchrun --nproc_per_node=2 finetune.py --dataset_path data/alpaca --lora_rank 8 --per_device_train_batch_size 2 --gradient_accumulation_steps 1 --max_steps 10000 --save_steps 1000 --save_total_limit 2 --learning_rate 2e-5 --fp16 --remove_unused_columns false --logging_steps 50 --output_dir output --deepspeed ds_config_zero3.json

测试了一下,跑不起来,报了这样的错误:

ValueError: DeepSpeed Zero-3 is not compatible with low_cpu_mem_usage=True or with passing a device_map.

magnificent1208 commented 1 year ago

deepspeed并行化操作如下: 1.增加deepspeed配置参数文件:ds_config_zero3.json `{ "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 },

"optimizer": {
    "type": "AdamW",
    "params": {
        "lr": "auto",
        "betas": "auto",
        "eps": "auto",
        "weight_decay": "auto"
    }
},

"scheduler": {
    "type": "WarmupLR",
    "params": {
        "warmup_min_lr": "auto",
        "warmup_max_lr": "auto",
        "warmup_num_steps": "auto"
    }
},

"zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
        "device": "cpu",
        "pin_memory": true
    },
    "offload_param": {
        "device": "cpu",
        "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
},

"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false

}2.修改modeling_chatglm.py 266行处bug,否则会报错“RuntimeError: expected mask dtype to be Bool but got Half”if not (attention_mask == 0).all(): attention_scores.maskedfill(attention_mask.byte(), -10000.0)增加.byte()一个把attention_mask改成bool值 3.执行以下命令,就可以分布式跑了:torchrun --nproc_per_node=2 finetune.py --dataset_path data/alpaca --lora_rank 8 --per_device_train_batch_size 2 --gradient_accumulation_steps 1 --max_steps 10000 --save_steps 1000 --save_total_limit 2 --learning_rate 2e-5 --fp16 --remove_unused_columns false --logging_steps 50 --output_dir output --deepspeed ds_config_zero3.json`

测试了一下,跑不起来,报了这样的错误:

ValueError: DeepSpeed Zero-3 is not compatible with low_cpu_mem_usage=True or with passing a device_map.

同。 UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU! ValueError: DeepSpeed Zero-3 is not compatible with low_cpu_mem_usage=True or with passing a device_map.

代码:

init model

model = ChatGLMForConditionalGeneration.from_pretrained(
    "THUDM/chatglm-6b", load_in_8bit=False, trust_remote_code=True, device_map='auto')

这块去掉device_map也不行。

liangwq commented 1 year ago

load_in_8bit=False, device_map='auto' 去掉 ds参数要熟悉下 我这边实测可以放上来的

liangwq commented 1 year ago

"fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, 如果确实不行 "enabled": true,

firslov commented 1 year ago

"fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, 如果确实不行 "enabled": true,

都试过了,还是不行,朋友fork一个吧

liangwq commented 1 year ago

https://github.com/liangwq/Chatglm_lora_multi-gpu 多gpu训练版本 1.可以多gpu训练chatglm+lora 2.可以保存checkpoint 3.现在还不支持断点接续训练,后面会更新

magnificent1208 commented 1 year ago

你好,up主更新了,请问你这跑通了吗。我用的时候搞了一大堆下载的 ..

liangwq commented 1 year ago

你好,up主更新了,请问你这跑通了吗。我用的时候搞了一大堆下载的 .. “https://github.com/liangwq/Chatglm_lora_multi-gpu 多gpu训练版本 1.可以多gpu训练chatglm+lora 2.可以保存checkpoint 3.现在还不支持断点接续训练,后面会更新” 这个方法下载了一堆东西嘛,是说需要需要安装你多python包,还是什么意思 如果只用deepspeed其实就是多下载一个deepspeed,如果有什么问题,你可以把问题贴上来,或者给我提issue,我空了都会看的

magnificent1208 commented 1 year ago

image

  1. 单卡试过成功训练和推理;
  2. 用这个多卡的时候,就会要求重新下这个模型(然后久了会有time-out的现象) urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Read timed out. (read timeout=10.0)
liangwq commented 1 year ago

image

  1. 单卡试过成功训练和推理;
  2. 用这个多卡的时候,就会要求重新下这个模型(然后久了会有time-out的现象) urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Read timed out. (read timeout=10.0)

多跑几次,这个事huggingface下载模型,网络不行,如果有vpn可以挂一个,如果没有提前下好模型,pretrain model加载换成你自己的路径就好了

magnificent1208 commented 1 year ago

image

  1. 单卡试过成功训练和推理;
  2. 用这个多卡的时候,就会要求重新下这个模型(然后久了会有time-out的现象) urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Read timed out. (read timeout=10.0)

多跑几次,这个事huggingface下载模型,网络不行,如果有vpn可以挂一个,如果没有提前下好模型,pretrain model加载换成你自己的路径就好了

感谢!我太小白了。但还有几个问题:

  1. 我单卡训练的时候,肯定是下载了模型的,但我找不到在哪
  2. 我应该在哪个位置,指定pretrain的路径呢?
liangwq commented 1 year ago

image

  1. 单卡试过成功训练和推理;
  2. 用这个多卡的时候,就会要求重新下这个模型(然后久了会有time-out的现象) urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Read timed out. (read timeout=10.0)

多跑几次,这个事huggingface下载模型,网络不行,如果有vpn可以挂一个,如果没有提前下好模型,pretrain model加载换成你自己的路径就好了

感谢!我太小白了。但还有几个问题:

  1. 我单卡训练的时候,肯定是下载了模型的,但我找不到在哪
  2. 我应该在哪个位置,指定pretrain的路径呢?

model = ChatGLMForConditionalGeneration.from_pretrained( "THUDM/chatglm-6b", cache_dir ='./', trust_remote_code=True ) cache_dir设定成你自己的地址

HarderThenHarder commented 1 year ago

我在 这里 基于 accelerate 加入了多卡训练的功能,需要的话可以看一看 :)

luohuan02 commented 1 year ago

推广一个简单 ddp 修改方式 https://zhuanlan.zhihu.com/p/621793987

renllll commented 1 year ago

请问下,这个批量数能改大吗,我用每批次10个数据的话,单卡跑一步5秒钟,双卡跑一步却要三分钟,感觉双卡比单卡慢很多了,这是为啥