Open 1a2cjitenfei opened 1 year ago
deepspeed
accelerate+deepspeed可以,我测试了下8卡的可以跑起来 我看看后面怎么加一个pr
accelerate推理其实也可以用,这个我测试也是没问题的 后面一起加上来
python3 -m torch.distributed.launch --nproc_per_node 4 \
--nnodes=1 --node_rank=0 --master_addr=xxx --master_port=yyy finetune.py --dataset_path data/alpaca --lora_rank 8 --per_device_train_batch_size 2 --gradient_accumulation_steps 1 --max_steps 10000 --save_steps 1000 --save_total_limit 2 --learning_rate 2e-5 --fp16 --remove_unused_columns false --logging_steps 50 --output_dir output
直接用pytorch的分布式训练能力也可以做多GPu训练
这个没用到deepspeed吧?
没有 用的torch的ddp,加两三行代码就搞定 你可以试试,如果好不行,我应该会在明后天搞一个pr上来
没有 用的torch的ddp,加两三行代码就搞定
你可以试试,如果好不行,我应该会在明后天搞一个pr上来
deepspeed怎么试?
deepspeed并行化操作如下: 1.增加deepspeed配置参数文件:ds_config_zero3.json `{ "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 },
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
} 2.修改modeling_chatglm.py 266行处bug,否则会报错“RuntimeError: expected mask dtype to be Bool but got Half”
if not (attention_mask == 0).all():
attention_scores.maskedfill(attention_mask.byte(), -10000.0)`
增加.byte()一个把attention_mask改成bool值
3.执行以下命令,就可以分布式跑了:
torchrun --nproc_per_node=2 finetune.py --dataset_path data/alpaca --lora_rank 8 --per_device_train_batch_size 2 --gradient_accumulation_steps 1 --max_steps 10000 --save_steps 1000 --save_total_limit 2 --learning_rate 2e-5 --fp16 --remove_unused_columns false --logging_steps 50 --output_dir output --deepspeed ds_config_zero3.json
deepspeed并行化操作如下: 1.增加deepspeed配置参数文件:ds_config_zero3.json `{ "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 },
"optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false
}
2.修改modeling_chatglm.py 266行处bug,否则会报错“RuntimeError: expected mask dtype to be Bool but got Half”
if not (attention_mask == 0).all(): attention_scores.maskedfill(attention_mask.byte(), -10000.0)` 增加.byte()一个把attention_mask改成bool值3.执行以下命令,就可以分布式跑了:
torchrun --nproc_per_node=2 finetune.py --dataset_path data/alpaca --lora_rank 8 --per_device_train_batch_size 2 --gradient_accumulation_steps 1 --max_steps 10000 --save_steps 1000 --save_total_limit 2 --learning_rate 2e-5 --fp16 --remove_unused_columns false --logging_steps 50 --output_dir output --deepspeed ds_config_zero3.json
测试了一下,跑不起来,报了这样的错误:
ValueError: DeepSpeed Zero-3 is not compatible with low_cpu_mem_usage=True
or with passing a device_map
.
deepspeed并行化操作如下: 1.增加deepspeed配置参数文件:ds_config_zero3.json `{ "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 },
"optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false
}
2.修改modeling_chatglm.py 266行处bug,否则会报错“RuntimeError: expected mask dtype to be Bool but got Half”
if not (attention_mask == 0).all(): attention_scores.maskedfill(attention_mask.byte(), -10000.0)增加.byte()一个把attention_mask改成bool值 3.执行以下命令,就可以分布式跑了:
torchrun --nproc_per_node=2 finetune.py --dataset_path data/alpaca --lora_rank 8 --per_device_train_batch_size 2 --gradient_accumulation_steps 1 --max_steps 10000 --save_steps 1000 --save_total_limit 2 --learning_rate 2e-5 --fp16 --remove_unused_columns false --logging_steps 50 --output_dir output --deepspeed ds_config_zero3.json`测试了一下,跑不起来,报了这样的错误:
ValueError: DeepSpeed Zero-3 is not compatible with
low_cpu_mem_usage=True
or with passing adevice_map
.
同。
UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
ValueError: DeepSpeed Zero-3 is not compatible with low_cpu_mem_usage=True
or with passing a device_map
.
代码:
model = ChatGLMForConditionalGeneration.from_pretrained(
"THUDM/chatglm-6b", load_in_8bit=False, trust_remote_code=True, device_map='auto')
这块去掉device_map也不行。
load_in_8bit=False, device_map='auto' 去掉 ds参数要熟悉下 我这边实测可以放上来的
"fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, 如果确实不行 "enabled": true,
"fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, 如果确实不行 "enabled": true,
都试过了,还是不行,朋友fork一个吧
https://github.com/liangwq/Chatglm_lora_multi-gpu 多gpu训练版本 1.可以多gpu训练chatglm+lora 2.可以保存checkpoint 3.现在还不支持断点接续训练,后面会更新
你好,up主更新了,请问你这跑通了吗。我用的时候搞了一大堆下载的 ..
你好,up主更新了,请问你这跑通了吗。我用的时候搞了一大堆下载的 .. “https://github.com/liangwq/Chatglm_lora_multi-gpu 多gpu训练版本 1.可以多gpu训练chatglm+lora 2.可以保存checkpoint 3.现在还不支持断点接续训练,后面会更新” 这个方法下载了一堆东西嘛,是说需要需要安装你多python包,还是什么意思 如果只用deepspeed其实就是多下载一个deepspeed,如果有什么问题,你可以把问题贴上来,或者给我提issue,我空了都会看的
- 单卡试过成功训练和推理;
- 用这个多卡的时候,就会要求重新下这个模型(然后久了会有time-out的现象) urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Read timed out. (read timeout=10.0)
多跑几次,这个事huggingface下载模型,网络不行,如果有vpn可以挂一个,如果没有提前下好模型,pretrain model加载换成你自己的路径就好了
- 单卡试过成功训练和推理;
- 用这个多卡的时候,就会要求重新下这个模型(然后久了会有time-out的现象) urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Read timed out. (read timeout=10.0)
多跑几次,这个事huggingface下载模型,网络不行,如果有vpn可以挂一个,如果没有提前下好模型,pretrain model加载换成你自己的路径就好了
感谢!我太小白了。但还有几个问题:
- 单卡试过成功训练和推理;
- 用这个多卡的时候,就会要求重新下这个模型(然后久了会有time-out的现象) urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Read timed out. (read timeout=10.0)
多跑几次,这个事huggingface下载模型,网络不行,如果有vpn可以挂一个,如果没有提前下好模型,pretrain model加载换成你自己的路径就好了
感谢!我太小白了。但还有几个问题:
- 我单卡训练的时候,肯定是下载了模型的,但我找不到在哪
- 我应该在哪个位置,指定pretrain的路径呢?
model = ChatGLMForConditionalGeneration.from_pretrained( "THUDM/chatglm-6b", cache_dir ='./', trust_remote_code=True ) cache_dir设定成你自己的地址
我在 这里 基于 accelerate 加入了多卡训练的功能,需要的话可以看一看 :)
推广一个简单 ddp 修改方式 https://zhuanlan.zhihu.com/p/621793987
请问下,这个批量数能改大吗,我用每批次10个数据的话,单卡跑一步5秒钟,双卡跑一步却要三分钟,感觉双卡比单卡慢很多了,这是为啥
默认好像是只使用第一块显卡