多卡似乎不能将每张卡跑满，请问如何才能让每张卡的计算负载跑满呢 - Githubissues

jianzhnie / LLamaTuner

Easy and Efficient Finetuning LLMs. (Supported LLama, LLama2, LLama3, Qwen, Baichuan, GLM , Falcon) 大模型高效量化训练+部署.

https://jianzhnie.github.io/llmtech/

Apache License 2.0

557 stars 62 forks source link

多卡似乎不能将每张卡跑满，请问如何才能让每张卡的计算负载跑满呢 #66

Open RayneSun opened 1 year ago

RayneSun commented 1 year ago

我设置了CUDA_VISIBLE_DEVICE和device_map，在2张A100上跑的时候，发现确实都有内存占用，但是gpu负载总是某张卡高，其他都很低。

jianzhnie commented 1 year ago

你训练用的哪个方法

RayneSun commented 1 year ago

用的lora，训练baichuan-13B

jianzhnie commented 1 year ago

不应该呀，我训练的时候卡基本都是占满的

RayneSun commented 1 year ago

大概就是这个样子，有点像是流水线并行

RayneSun commented 1 year ago

是不是因为我没有用deepspeed呢？能麻烦看一下您跑baichuan-13b的shell脚本吗

jianzhnie commented 1 year ago

https://github.com/jianzhnie/Efficient-Tuning-LLMs/blob/main/train_lora.py#L169C12-L169C13

jianzhnie commented 1 year ago

或许在这个位置，开启了模型并行，你注释掉这两行试试

RayneSun commented 1 year ago

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=1 train_lora.py \ --model_name_or_path ../Baichuan-13B-Chat \ --dataset_name train.json,test.json \ --data_dir ../../data/toolbench \ --load_from_local yes \ --output_dir baichuan-lora \ --max_steps 50000 \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 8 \ --evaluation_strategy no \ --save_strategy steps \ --save_steps 1000 \ --learning_rate 5e-4 \ --weight_decay 0. \ --warmup_ratio 0.07 \ --optim "adamw_torch" \ --lr_scheduler_type "linear" \ --model_max_length 2560 \ --source_max_len 2048 \ --target_max_len 512 \ --logging_steps 5 \ --do_train \ --gradient_checkpointing True \ --trust_remote_code true \ --lora_target_modules W_pack \ --deepspeed "ds_config_zero3_auto.json

RayneSun commented 1 year ago

我注释掉您说的那两句了，但是跑的时候还是单张卡占用高

RayneSun commented 1 year ago

而且我把train_lora的device_map配置去掉了：

因为不去掉会报错： ValueError: DeepSpeed Zero-3 is not compatible with low_cpu_mem_usage=True or with passing a device_map. 请问和这个相关吗？

RayneSun commented 1 year ago

好像找到问题了，需要设置启动时的参数--nproc_per_node=2

wgzhendong commented 1 year ago

好像找到问题了，需要设置启动时的参数--nproc_per_node=2

你能完整训练完吗，我和你一样的训练代码跑了200步就挂了

RayneSun commented 1 year ago

最后没用deepspeed，速度反而会特别慢