Fail to finetune with several GPU

banalg commented 5 months ago

Hello,

We successfully fine-tuned the Mistral7b_v0.3 Instruct model using a single GPU, but we encountered issues when trying to utilize multiple GPUs.

The successful fine-tuning with one GPU (A10 - 24Go)was achieved with the following settings:

Limited the training to 100 steps with a sequence length of 1000
Forced the use of a single GPU using the environment variable CUDA_VISIBLE_DEVICES
Employed a small training file containing around 150 messages

However, we have not been able to successfully configure the setup to use more than one GPU, which limit us to improve the training quality and model knowledge size.

When using several GPU, the train.py seems to block at the dist.barrier() (line 97). We bypassed this with using the environment variable NCCL_P2P_DISABLE=1, but then we're block arround the batch = next(data_loader) (line 228)

Thank you for your assistance.

Here are the details of our setup

AWS g5.12xlarge (4*A10 - 24Go each)
Ubuntu 22.04 LTS
Python 3.10 venv
NCCL version 2.19.3+cuda12.3 (we tried several versions without success)
tested in non-root and root

Command used to run the training

CUDA_VISIBLE_DEVICES=2,3 torchrun --nproc-per-node 2 --master_port $RANDOM -m train example/config_instruct_v1.yaml

The config file _example/config_instructv1.yaml

data:
  instruct_data: "../data/instruct_request_v0.2.json"  # Fill
  data: ""  # Optionally fill with pretraining data
  eval_instruct_data: ""  # Optionally fill

  # model
  model_id_or_path: "../mistral_models/instruct"  # Change to downloaded path
  lora:
    rank: 64

  # optim
  seq_len: 32768
  batch_size: 1
  max_steps: 300
  optim:
    lr: 6.e-5
    weight_decay: 0.1
    pct_start: 0.05

  # other
  seed: 0
  log_freq: 1
  eval_freq: 100
  no_eval: True
  ckpt_freq: 100

  save_adapters: True  # save only trained LoRA adapters. Set to `False` to merge LoRA adapter into the base model and save full fine-tuned model

  run_dir: "/data/ft/finetuning_instruct_admin_1"

Logs of the train.py

(.venv2) root@ip-10-10-10-10:/data/ft/mistral-finetune# TORCH_LOGS="all" CUDA_VISIBLE_DEVICES=2,3 torchrun --nproc-per-node 2 --master_port $RANDOM -m train example/config_instruct_v1.yaml [2024-06-13 21:57:36,864] torch.distributed.run: [WARNING] [2024-06-13 21:57:36,864] torch.distributed.run: [WARNING] [2024-06-13 21:57:36,864] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-06-13 21:57:36,864] torch.distributed.run: [WARNING] [2024-06-13 21:57:36,865] torch.distributed.elastic.rendezvous.static_tcp_rendezvous: [INFO] Creating TCPStore as the c10d::Store implementation [2024-06-13 21:57:38,580] torch.distributed.nn.jit.instantiator: [INFO] Created a temporary directory at /data/tmp/tmpm3mt26aa [2024-06-13 21:57:38,580] torch.distributed.nn.jit.instantiator: [INFO] Writing /data/tmp/tmpm3mt26aa/_remote_module_non_scriptable.py [2024-06-13 21:57:38,581] torch.distributed.nn.jit.instantiator: [INFO] Created a temporary directory at /data/tmp/tmpt13cbb4l [2024-06-13 21:57:38,582] torch.distributed.nn.jit.instantiator: [INFO] Writing /data/tmp/tmpt13cbb4l/_remote_module_non_scriptable.py args: TrainArgs(data=DataArgs(data='', shuffle=False, instruct_data='../data/instruct_request_v0.2.json', eval_instruct_data='', instruct=InstructArgs(shuffle=True, dynamic_chunk_fn_call=True)), model_id_or_path='../mistral_models/instruct', run_dir='/data/ft/finetuning_instruct_admin_1', optim=OptimArgs(lr=6e-05, weight_decay=0.1, pct_start=0.05), seed=0, num_microbatches=1, seq_len=32768, batch_size=1, max_norm=1.0, max_steps=300, log_freq=1, ckpt_freq=100, save_adapters=True, no_ckpt=False, num_ckpt_keep=3, eval_freq=100, no_eval=True, checkpoint=True, world_size=2, wandb=WandbArgs(project=None, offline=False, key=None, run_name=None), mlflow=MLFlowArgs(tracking_uri=None, experiment_name=None), lora=LoraArgs(enable=True, rank=64, dropout=0.0, scaling=2.0)) 2024-06-13 21:57:39 (UTC) - 0:00:02 - distributed - INFO - torch.cuda.device_count: 2 2024-06-13 21:57:39 (UTC) - 0:00:02 - distributed - INFO - CUDA_VISIBLE_DEVICES: 2,3 2024-06-13 21:57:39 (UTC) - 0:00:02 - distributed - INFO - local rank: 0 2024-06-13 21:57:39 (UTC) - 0:00:02 - distributed - INFO - Set cuda device to 0 2024-06-13 21:57:39 (UTC) - 0:00:02 - train - INFO - Going to init comms... [2024-06-13 21:57:39,323] torch.distributed.distributed_c10d: [INFO] Using backend config: {'cuda': 'nccl'} 2024-06-13 21:57:39 (UTC) - 0:00:02 - train - INFO - Run dir: /data/ft/finetuning_instruct_admin_1 [rank0]:[2024-06-13 21:57:39,324] torch.distributed.distributed_c10d: [INFO] Using device cuda for object collectives. args: TrainArgs(data=DataArgs(data='', shuffle=False, instruct_data='../data/instruct_request_v0.2.json', eval_instruct_data='', instruct=InstructArgs(shuffle=True, dynamic_chunk_fn_call=True)), model_id_or_path='../mistral_models/instruct', run_dir='/data/ft/finetuning_instruct_admin_1', optim=OptimArgs(lr=6e-05, weight_decay=0.1, pct_start=0.05), seed=0, num_microbatches=1, seq_len=32768, batch_size=1, max_norm=1.0, max_steps=300, log_freq=1, ckpt_freq=100, save_adapters=True, no_ckpt=False, num_ckpt_keep=3, eval_freq=100, no_eval=True, checkpoint=True, world_size=2, wandb=WandbArgs(project=None, offline=False, key=None, run_name=None), mlflow=MLFlowArgs(tracking_uri=None, experiment_name=None), lora=LoraArgs(enable=True, rank=64, dropout=0.0, scaling=2.0)) 2024-06-13 21:57:39 (UTC) - 0:00:02 - distributed - INFO - torch.cuda.device_count: 2 2024-06-13 21:57:39 (UTC) - 0:00:02 - distributed - INFO - CUDA_VISIBLE_DEVICES: 2,3 2024-06-13 21:57:39 (UTC) - 0:00:02 - distributed - INFO - local rank: 1 2024-06-13 21:57:39 (UTC) - 0:00:02 - distributed - INFO - Set cuda device to 1 2024-06-13 21:57:39 (UTC) - 0:00:02 - train - INFO - Going to init comms... [2024-06-13 21:57:39,555] torch.distributed.distributed_c10d: [INFO] Using backend config: {'cuda': 'nccl'} [rank1]:[2024-06-13 21:57:39,556] torch.distributed.distributed_c10d: [INFO] Using device cuda for object collectives. NCCL version 2.19.3+cuda12.3 2024-06-13 21:57:39 (UTC) - 0:00:02 - train - INFO - TrainArgs: {'batch_size': 1, 'checkpoint': True, 'ckpt_freq': 100, 'data': {'data': '', 'eval_instruct_data': '', 'instruct': {'dynamic_chunk_fn_call': True, 'shuffle': True}, 'instruct_data': '../data/instruct_request_v0.2.json', 'shuffle': False}, 'eval_freq': 100, 'log_freq': 1, 'lora': {'dropout': 0.0, 'enable': True, 'rank': 64, 'scaling': 2.0}, 'max_norm': 1.0, 'max_steps': 300, 'mlflow': {'experiment_name': None, 'tracking_uri': None}, 'model_id_or_path': '../mistral_models/instruct', 'no_ckpt': False, 'no_eval': True, 'num_ckpt_keep': 3, 'num_microbatches': 1, 'optim': {'lr': 6e-05, 'pct_start': 0.05, 'weight_decay': 0.1}, 'run_dir': '/data/ft/finetuning_instruct_admin_1', 'save_adapters': True, 'seed': 0, 'seq_len': 32768, 'wandb': {'key': None, 'offline': False, 'project': None, 'run_name': None}, 'world_size': 2} 2024-06-13 21:57:39 (UTC) - 0:00:03 - finetune.wrapped_model - INFO - Reloading model from ../mistral_models/instruct/consolidated.safetensors ... 2024-06-13 21:57:39 (UTC) - 0:00:03 - finetune.wrapped_model - INFO - Converting model to dtype torch.bfloat16 ... 2024-06-13 21:57:39 (UTC) - 0:00:03 - finetune.wrapped_model - INFO - Loaded model on cpu! 2024-06-13 21:57:39 (UTC) - 0:00:03 - finetune.wrapped_model - INFO - Initializing lora layers ... 2024-06-13 21:57:40 (UTC) - 0:00:03 - finetune.wrapped_model - INFO - Finished initialization! 2024-06-13 21:57:40 (UTC) - 0:00:03 - finetune.wrapped_model - INFO - Sharding model over 2 GPUs ... 2024-06-13 21:57:46 (UTC) - 0:00:09 - finetune.wrapped_model - INFO - Model sharded! 2024-06-13 21:57:46 (UTC) - 0:00:09 - finetune.wrapped_model - INFO - 167,772,160 out of 7,415,795,712 parameters are finetuned (2.26%). 2024-06-13 21:57:46 (UTC) - 0:00:10 - dataset - INFO - Loading ../data/instruct_request_v0.2.json ... 2024-06-13 21:57:46 (UTC) - 0:00:10 - dataset - INFO - ../data/instruct_request_v0.2.json loaded and tokenized. 2024-06-13 21:57:46 (UTC) - 0:00:10 - dataset - INFO - Shuffling ../data/instruct_request_v0.2.json ... 2024-06-13 21:57:46 (UTC) - 0:00:10 - dataset - INFO - Shuffling ../data/instruct_request_v0.2.json ...

NCCL logs

tail: /data/var/log/nccl_debug.log: file truncated ip-10-10-10-10:19202:19202 [0] NCCL INFO Bootstrap : Using ens5:10.10.10.10<0> ip-10-10-10-10:19202:19202 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation ip-10-10-10-10:19202:19202 [0] NCCL INFO cudaDriverVersion 12050 ip-10-10-10-10:19202:19202 [0] NCCL INFO NCCL version 2.19.3+cuda12.3 tail: /data/var/log/nccl_debug.log: file truncated ip-10-10-10-10:19203:19203 [1] NCCL INFO cudaDriverVersion 12050 ip-10-10-10-10:19203:19203 [1] NCCL INFO Bootstrap : Using ens5:10.10.10.10<0> ip-10-10-10-10:19203:19203 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation ip-10-10-10-10:19202:19217 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1. ip-10-10-10-10:19202:19217 [0] NCCL INFO NET/Socket : Using [0]ens5:10.10.10.10<0> [1]br-ae5b5c3787a1:172.19.0.1<0> [2]veth9ea7c0d:fe80::b498:75ff:fe18:107d%veth9ea7c0d<0> [3]veth634837c:fe80::5cd1:75ff:fe11:9735%veth634837c<0> [4]vethbf40cbc:fe80::fc73:9aff:fefd:cbec%vethbf40cbc<0> [5]veth8ae4512:fe80::d49b:5fff:fe17:6b0c%veth8ae4512<0> [6]veth3f1c863:fe80::c464:80ff:fe3e:cd9f%veth3f1c863<0> [7]veth7c8f20a:fe80::64c0:e9ff:fec8:be06%veth7c8f20a<0> ip-10-10-10-10:19202:19217 [0] NCCL INFO Using non-device net plugin version 0 ip-10-10-10-10:19202:19217 [0] NCCL INFO Using network Socket 2 cudaDev 1 nvmlDev 3 busId 1e0 commId 0x1e2c7b29947d4dbf - Init START 2 cudaDev 0 nvmlDev 2 busId 1d0 commId 0x1e2c7b29947d4dbf - Init START ip-10-10-10-10:19202:19217 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC -1/-1/-1->1->0 ip-10-10-10-10:19203:19218 [1] NCCL INFO P2P Chunksip-10-10-10-10:19202:19217 [0] NCCL INFO Channel 01/02 : 0 1 ip-10-10-10-10:19202:19217 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 ip-10-10-10-10:19202:19217 [0] NCCL INFO P2P Chunksize set to 131072 ip-10-10-10-10:19202:19217 [0] NCCL INFO Channel 00 : 0[2] -> 1[3] via SHM/direct/direct ip-10-10-10-10:19202:19217 [0] NCCL INFO Channel 01 : 0[2] -> 1[3] via SHM/direct/direct | 512 16-46-125:19202:19217 [0] NCCL INFO Connected all rings ip-10-10-10-10:19202:19217 [0] NCCL INFO Connected all trees ip-10-10-10-10:19202:19217 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 ip-10-10-10-10:19202:19217 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer ip-10-10-10-10:19202:19217 [0] NCCL INFO comm 0x88a2220 rank 0 nranks 2 cudaDev 0 nvmlDev 2 busId 1d0 commId 0x1e2c7b29947d4dbf - Init COMPLETE

banalg commented 5 months ago

It's working now. We simply halted the instance for the night, and after restarting it in the morning, the fine-tuning with all 4 GPUs worked. It's "tombé en marche," as we usually say, but I would prefer to understand why we had issues in the first place. Our instance likely started on another server than yesterday. Could you please recommend some checks to detect the hardware and software configurations of the server that could impact the parallel GPU fine-tuning?

We'll wait a few days before closing this issue.

Aniket-J commented 5 months ago

Did you figure out what's been prompting this? Similar setup as yours, tried with NCCL_P2P_DISABLE set to 1, however we're using g4.12xlarge and not g5

SaiKrishnaBala commented 3 months ago

Is it possible to run these scripts on a ray cluster as a training job?

mistralai / mistral-finetune