1a2cjitenfei commented 1 year ago

默认好像是只使用第一块显卡

joan126 commented 1 year ago

deepspeed

liangwq commented 1 year ago

accelerate+deepspeed可以，我测试了下8卡的可以跑起来我看看后面怎么加一个pr

liangwq commented 1 year ago

accelerate推理其实也可以用，这个我测试也是没问题的后面一起加上来

paulcx commented 1 year ago

python3 -m torch.distributed.launch --nproc_per_node 4 \

--nnodes=1 --node_rank=0 --master_addr=xxx --master_port=yyy finetune.py  --dataset_path data/alpaca --lora_rank 8 --per_device_train_batch_size 2 --gradient_accumulation_steps 1 --max_steps 10000 --save_steps 1000 --save_total_limit 2 --learning_rate 2e-5 --fp16 --remove_unused_columns false --logging_steps 50 --output_dir output

直接用pytorch的分布式训练能力也可以做多GPu训练

这个没用到deepspeed吧？

liangwq commented 1 year ago

没有用的torch的ddp，加两三行代码就搞定你可以试试，如果好不行，我应该会在明后天搞一个pr上来

paulcx commented 1 year ago

没有用的torch的ddp，加两三行代码就搞定

你可以试试，如果好不行，我应该会在明后天搞一个pr上来

deepspeed怎么试？

liangwq commented 1 year ago

deepspeed并行化操作如下： 1.增加deepspeed配置参数文件：ds_config_zero3.json `{ "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 },

"optimizer": {
    "type": "AdamW",
    "params": {
        "lr": "auto",
        "betas": "auto",
        "eps": "auto",
        "weight_decay": "auto"
    }
},

"scheduler": {
    "type": "WarmupLR",
    "params": {
        "warmup_min_lr": "auto",
        "warmup_max_lr": "auto",
        "warmup_num_steps": "auto"
    }
},

"zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
        "device": "cpu",
        "pin_memory": true
    },
    "offload_param": {
        "device": "cpu",
        "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
},

"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false

} 2.修改modeling_chatglm.py 266行处bug，否则会报错“RuntimeError: expected mask dtype to be Bool but got Half” if not (attention_mask == 0).all(): attention_scores.maskedfill(attention_mask.byte(), -10000.0)` 增加.byte()一个把attention_mask改成bool值

3.执行以下命令，就可以分布式跑了： torchrun --nproc_per_node=2 finetune.py --dataset_path data/alpaca --lora_rank 8 --per_device_train_batch_size 2 --gradient_accumulation_steps 1 --max_steps 10000 --save_steps 1000 --save_total_limit 2 --learning_rate 2e-5 --fp16 --remove_unused_columns false --logging_steps 50 --output_dir output --deepspeed ds_config_zero3.json

firslov commented 1 year ago

deepspeed并行化操作如下： 1.增加deepspeed配置参数文件：ds_config_zero3.json `{ "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 },
"optimizer": {
    "type": "AdamW",
    "params": {
        "lr": "auto",
        "betas": "auto",
        "eps": "auto",
        "weight_decay": "auto"
    }
},

"scheduler": {
    "type": "WarmupLR",
    "params": {
        "warmup_min_lr": "auto",
        "warmup_max_lr": "auto",
        "warmup_num_steps": "auto"
    }
},

"zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
        "device": "cpu",
        "pin_memory": true
    },
    "offload_param": {
        "device": "cpu",
        "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
},

"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}2.修改modeling_chatglm.py 266行处bug，否则会报错“RuntimeError: expected mask dtype to be Bool but got Half”if not (attention_mask == 0).all(): attention_scores.maskedfill(attention_mask.byte(), -10000.0)` 增加.byte()一个把attention_mask改成bool值

3.执行以下命令，就可以分布式跑了： torchrun --nproc_per_node=2 finetune.py --dataset_path data/alpaca --lora_rank 8 --per_device_train_batch_size 2 --gradient_accumulation_steps 1 --max_steps 10000 --save_steps 1000 --save_total_limit 2 --learning_rate 2e-5 --fp16 --remove_unused_columns false --logging_steps 50 --output_dir output --deepspeed ds_config_zero3.json

测试了一下，跑不起来，报了这样的错误：

ValueError: DeepSpeed Zero-3 is not compatible with low_cpu_mem_usage=True or with passing a device_map.

magnificent1208 commented 1 year ago

deepspeed并行化操作如下： 1.增加deepspeed配置参数文件：ds_config_zero3.json `{ "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 },
"optimizer": {
    "type": "AdamW",
    "params": {
        "lr": "auto",
        "betas": "auto",
        "eps": "auto",
        "weight_decay": "auto"
    }
},

"scheduler": {
    "type": "WarmupLR",
    "params": {
        "warmup_min_lr": "auto",
        "warmup_max_lr": "auto",
        "warmup_num_steps": "auto"
    }
},

"zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
        "device": "cpu",
        "pin_memory": true
    },
    "offload_param": {
        "device": "cpu",
        "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
},

"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}2.修改modeling_chatglm.py 266行处bug，否则会报错“RuntimeError: expected mask dtype to be Bool but got Half”if not (attention_mask == 0).all(): attention_scores.maskedfill(attention_mask.byte(), -10000.0)增加.byte()一个把attention_mask改成bool值 3.执行以下命令，就可以分布式跑了：torchrun --nproc_per_node=2 finetune.py --dataset_path data/alpaca --lora_rank 8 --per_device_train_batch_size 2 --gradient_accumulation_steps 1 --max_steps 10000 --save_steps 1000 --save_total_limit 2 --learning_rate 2e-5 --fp16 --remove_unused_columns false --logging_steps 50 --output_dir output --deepspeed ds_config_zero3.json`
测试了一下，跑不起来，报了这样的错误：

ValueError: DeepSpeed Zero-3 is not compatible with low_cpu_mem_usage=True or with passing a device_map.

同。 UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU! ValueError: DeepSpeed Zero-3 is not compatible with low_cpu_mem_usage=True or with passing a device_map.

代码：

init model

model = ChatGLMForConditionalGeneration.from_pretrained(
    "THUDM/chatglm-6b", load_in_8bit=False, trust_remote_code=True, device_map='auto')

这块去掉device_map也不行。

liangwq commented 1 year ago

load_in_8bit=False, device_map='auto' 去掉 ds参数要熟悉下我这边实测可以放上来的

liangwq commented 1 year ago

"fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, 如果确实不行 "enabled": true,

firslov commented 1 year ago

"fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, 如果确实不行 "enabled": true,

都试过了，还是不行，朋友fork一个吧

liangwq commented 1 year ago

https://github.com/liangwq/Chatglm_lora_multi-gpu 多gpu训练版本 1.可以多gpu训练chatglm+lora 2.可以保存checkpoint 3.现在还不支持断点接续训练，后面会更新

magnificent1208 commented 1 year ago

你好，up主更新了，请问你这跑通了吗。我用的时候搞了一大堆下载的 ..

liangwq commented 1 year ago

你好，up主更新了，请问你这跑通了吗。我用的时候搞了一大堆下载的 .. “https://github.com/liangwq/Chatglm_lora_multi-gpu 多gpu训练版本 1.可以多gpu训练chatglm+lora 2.可以保存checkpoint 3.现在还不支持断点接续训练，后面会更新” 这个方法下载了一堆东西嘛，是说需要需要安装你多python包，还是什么意思如果只用deepspeed其实就是多下载一个deepspeed，如果有什么问题，你可以把问题贴上来，或者给我提issue，我空了都会看的

magnificent1208 commented 1 year ago

单卡试过成功训练和推理；
用这个多卡的时候，就会要求重新下这个模型（然后久了会有time-out的现象） urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Read timed out. (read timeout=10.0)

liangwq commented 1 year ago

单卡试过成功训练和推理；

用这个多卡的时候，就会要求重新下这个模型（然后久了会有time-out的现象） urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Read timed out. (read timeout=10.0)

多跑几次，这个事huggingface下载模型，网络不行，如果有vpn可以挂一个，如果没有提前下好模型，pretrain model加载换成你自己的路径就好了

magnificent1208 commented 1 year ago

单卡试过成功训练和推理；

用这个多卡的时候，就会要求重新下这个模型（然后久了会有time-out的现象） urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Read timed out. (read timeout=10.0)

多跑几次，这个事huggingface下载模型，网络不行，如果有vpn可以挂一个，如果没有提前下好模型，pretrain model加载换成你自己的路径就好了

感谢！我太小白了。但还有几个问题：

我单卡训练的时候，肯定是下载了模型的，但我找不到在哪
我应该在哪个位置，指定pretrain的路径呢？

liangwq commented 1 year ago

单卡试过成功训练和推理；

用这个多卡的时候，就会要求重新下这个模型（然后久了会有time-out的现象） urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Read timed out. (read timeout=10.0)

多跑几次，这个事huggingface下载模型，网络不行，如果有vpn可以挂一个，如果没有提前下好模型，pretrain model加载换成你自己的路径就好了

感谢！我太小白了。但还有几个问题：

我单卡训练的时候，肯定是下载了模型的，但我找不到在哪

我应该在哪个位置，指定pretrain的路径呢？

model = ChatGLMForConditionalGeneration.from_pretrained( "THUDM/chatglm-6b", cache_dir ='./', trust_remote_code=True ) cache_dir设定成你自己的地址

HarderThenHarder commented 1 year ago

我在这里基于 accelerate 加入了多卡训练的功能，需要的话可以看一看 :)

luohuan02 commented 1 year ago

推广一个简单 ddp 修改方式 https://zhuanlan.zhihu.com/p/621793987

renllll commented 1 year ago

请问下，这个批量数能改大吗，我用每批次10个数据的话，单卡跑一步5秒钟，双卡跑一步却要三分钟，感觉双卡比单卡慢很多了，这是为啥

mymusise / ChatGLM-Tuning

怎么使用多卡进行训练？ #84

init model