lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
36.91k stars 4.55k forks source link

2 node speed is not faster than 1 node #1153

Open lmolhw5252 opened 1 year ago

lmolhw5252 commented 1 year ago

I use 1 node with 4V100,got 700it/s,and 1 node with 4P40,got 300it/s, but when I use 2 nodes with 4V100 and 4P40 by deepspeed,got “4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time” this is my script

`

!/bin/bash

export NCCL_IB_DISABLE=1; export NCCL_P2P_DISABLE=1; export NCCL_DEBUG=INFO; export NCCL_SOCKET_IFNAME=enp,ens; export CXX=g++; deepspeed --hostfile hostfile \ --master_addr p40 \ --master_port 29600 \ fastchat/train/train.py \ --model_name_or_path /data2/lhw/FastChat/models/vicuna-7b \ --data_path /data2/lhw/FastChat/playground/data/onlineque_v2.0.json \ --fp16 True \ --output_dir ./output \ --num_train_epochs 3 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 32 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1200 \ --save_total_limit 10 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 False \ --model_max_length 2048 \ --gradient_checkpointing True \ --lazy_preprocess True \ --deepspeed /data2/lhw/FastChat/fastchat/train/deepspeed-config.json

`

and deep speed-config.json is

json { "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "train_micro_batch_size_per_gpu": "auto" } I use train.py because of flash_attn is not support v100 and p40.

Minxiangliu commented 1 year ago

Hi @lmolhw5252 , I currently have one A100 (40GB) GPU and I am training using your recommendations. However, I encounter an issue where it ultimately displays exits with return code = -9 without any error messages, making it difficult for me to understand the cause. It's possible that either the CPU or GPU memory is full. Do you have any suggestions on how to proceed?

Configuration: python: 3.10.11 fschat: 0.2.9 pytorch: 2.0.1 with cuda 11.7 deepspeed: 0.9.2 Installed CUDA version: 11.6 nvidia driver: 510.73.05

Executed command:

export NCCL_IB_DISABLE=1;
export NCCL_P2P_DISABLE=1;
export NCCL_DEBUG=INFO;
export NCCL_SOCKET_IFNAME=en,eth,em,bond;
export CXX=g++;
deepspeed --num_gpus 1 --num_nodes 1 \
FastChat/fastchat/train/train_mem.py \
    --model_name_or_path /raid/minxiang83/Program/vicuna/llama-7b  \
    --data_path /raid/minxiang83/Program/vicuna/datasets/dummy.json \
    --bf16 True \
    --output_dir finetune_output \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1200 \
    --save_total_limit 10 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --deepspeed deepspeed.json

deepspeed.json:

{
    "zero_optimization":{
       "stage":3,
       "offload_optimizer":{
          "device":"cpu",
          "pin_memory":true
       },
       "overlap_comm":true,
       "contiguous_gradients":true
    },
    "optimizer":{
       "type":"AdamW",
       "params":{
          "lr":"auto",
          "betas":"auto",
          "eps":"auto",
          "weight_decay":"auto"
       }
    },
    "train_micro_batch_size_per_gpu":"auto"
}

Output:

[2023-05-17 01:54:35,372] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-05-17 01:54:35,384] [INFO] [runner.py:541:main] cmd = /root/miniconda3/envs/vicuna/bin/python -u -m deepspeed.launcher.launch --world_info=eJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None FastChat/fastchat/train/train_mem.py --model_name_o_path /raid/minxiang83/Program/vicuna/llama-7b --data_path /raid/minxiang83/Program/vicuna/datasets/dummy.json --bf16 True --output_dir finetune_otput --num_train_epochs 3 --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --gradient_accumulation_steps 16 --evaluation_strategy no--save_strategy steps --save_steps 1200 --save_total_limit 10 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosin --logging_steps 1 --tf32 True --model_max_length 2048 --gradient_checkpointing True --lazy_preprocess True --deepspeed deepspeed.json
[2023-05-17 01:54:37,329] [INFO] [launch.py:222:main] 0 NCCL_DEBUG=INFO
[2023-05-17 01:54:37,329] [INFO] [launch.py:222:main] 0 NCCL_SOCKET_IFNAME=en,eth,em,bond
[2023-05-17 01:54:37,329] [INFO] [launch.py:222:main] 0 NCCL_P2P_DISABLE=1
[2023-05-17 01:54:37,329] [INFO] [launch.py:222:main] 0 NCCL_IB_DISABLE=1
[2023-05-17 01:54:37,329] [INFO] [launch.py:222:main] 0 NCCL_VERSION=2.7.8
[2023-05-17 01:54:37,329] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0]}
[2023-05-17 01:54:37,329] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-05-17 01:54:37,329] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-05-17 01:54:37,329] [INFO] [launch.py:247:main] dist_world_size=1
[2023-05-17 01:54:37,329] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-05-17 01:54:39,738] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
mx-69977d7b58-zrz6r:67149:67149 [0] NCCL INFO Bootstrap : Using eth0:192.168.200.187<0>
mx-69977d7b58-zrz6r:67149:67149 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
mx-69977d7b58-zrz6r:67149:67149 [0] NCCL INFO cudaDriverVersion 11060
NCCL version 2.14.3+cuda11.7
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.200.187<0>
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Using network Socket
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Setting affinity for GPU 0 to ffff0000,00000000,00000000,00000000,ffff0000,00000000,00000000
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Channel 00/32 :    0
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Channel 01/32 :    0
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Channel 02/32 :    0
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Channel 03/32 :    0
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Channel 04/32 :    0
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Channel 05/32 :    0
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Channel 06/32 :    0
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Channel 07/32 :    0
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Channel 08/32 :    0
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Channel 09/32 :    0
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Channel 10/32 :    0
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Channel 11/32 :    0
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Channel 12/32 :    0
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Channel 13/32 :    0
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Channel 14/32 :    0
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Channel 15/32 :    0
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Channel 16/32 :    0
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Channel 17/32 :    0
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Channel 18/32 :    0
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Channel 19/32 :    0
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Channel 20/32 :    0
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Channel 21/32 :    0
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Channel 22/32 :    0
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Channel 23/32 :    0
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Channel 24/32 :    0
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Channel 25/32 :    0
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Channel 26/32 :    0
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Channel 27/32 :    0
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Channel 28/32 :    0
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Channel 29/32 :    0
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Channel 30/32 :    0
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Channel 31/32 :    0
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1[12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1[19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1[26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Connected all rings
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO Connected all trees
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
mx-69977d7b58-zrz6r:67149:67224 [0] NCCL INFO comm 0x442e70e0 rank 0 nranks 1 cudaDev 0 busId b7000 - Init COMPLETE
[2023-05-17 01:54:46,755] [INFO] [partition_parameters.py:454:__exit__] finished initializing model with 6.74B parameters
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:13<00:00, 36.62s/it
Loading data...
#train 891, #eval 19
Formatting inputs...Skip in lazy mode
Formatting inputs...Skip in lazy mode
Installed CUDA version 11.6 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.6 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.4021217823028564 seconds
Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py310_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.06558895111083984 seconds
Parameter Offload: Total persistent parameters: 266240 in 65 params
[2023-05-17 01:56:17,478] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 67149
[2023-05-17 01:56:17,479] [ERROR] [launch.py:434:sigkill_handler] ['/root/miniconda3/envs/vicuna/bin/python', '-u', 'FastChat/fastchat/train/trainmem.py', '--local_rank=0', '--model_name_or_path', '/raid/minxiang83/Program/vicuna/llama-7b', '--data_path', '/raid/minxiang83/Program/vicuna/datsets/dummy.json', '--bf16', 'True', '--output_dir', 'finetune_output', '--num_train_epochs', '3', '--per_device_train_batch_size', '2', '--per_devce_eval_batch_size', '2', '--gradient_accumulation_steps', '16', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '1200' '--save_total_limit', '10', '--learning_rate', '2e-5', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--loggng_steps', '1', '--tf32', 'True', '--model_max_length', '2048', '--gradient_checkpointing', 'True', '--lazy_preprocess', 'True', '--deepspeed', 'depspeed.json'] exits with return code = -9
lmolhw5252 commented 1 year ago

you may check the resource like memory,or cpu ,you can set batch=1

Minxiangliu commented 1 year ago

you may check the resource like memory,or cpu ,you can set batch=1

Are you suggesting setting both per_device_train_batch_size and per_device_eval_batch_size to 1?

merrymercy commented 1 year ago

see also #1255

lmolhw5252 commented 1 year ago

@merrymercy 看你资源利用情况就好,能多就多

invoker-bot commented 1 year ago

I use the same deepspeed.json as you provide. But when I use python3 -m FastChat.serve.cli --model-path /path/to/my_model, I got RuntimeError: 'weight' must be 2-D. This error seems from torch.embedding. Have you met the same problem?

Zhangxy6277 commented 1 year ago

I use the same deepspeed.json as you provide. But when I use python3 -m FastChat.serve.cli --model-path /path/to/my_model, I got RuntimeError: 'weight' must be 2-D. This error seems from torch.embedding. Have you met the same problem?

I have encountered the same problem, have you solved it?

invoker-bot commented 1 year ago

I use the same deepspeed.json as you provide. But when I use python3 -m FastChat.serve.cli --model-path /path/to/my_model, I got RuntimeError: 'weight' must be 2-D. This error seems from torch.embedding. Have you met the same problem?

I have encountered the same problem, have you solved it?

You can see #508, it seems to work fine.