princeton-nlp / LESS

[ICML 2024] LESS: Selecting Influential Data for Targeted Instruction Tuning
MIT License
376 stars 36 forks source link

At step1, single GPU works while multiple GPUs get stuck. #22

Open timturing opened 5 months ago

timturing commented 5 months ago

When I follow the same process as step 1. It's OK for me to set the nproc_per_node to 1 in base_training_args.sh (and export CUDA_VISIBLE_DEVICES to my custom device). However when I set it to a value larger than 1 (and set CUDA_VISIBLE_DEVICES at the same time), it always gets stuck when it comes to this place:

[train set] examples: 13533; # avg tokens: 370.9773254394531
[train set] examples: 13533; # avg completion tokens: 105.39820861816406
/mnt/workspace/anaconda3/envs/LESS/lib/python3.9/site-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass an `accelerate.DataLoaderConfiguration` instead: 
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
  warnings.warn(
[INFO|trainer.py:568] 2024-06-28 22:31:18,153 >> Using auto half precision backend

Also, to avoid another issue, I add base_training_args="$base_training_args --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune" before setting the training_args. The experiment was done on 4 H100. The python version is 3.9.0 and the whole pip list is below:

accelerate               0.28.0
aiohttp                  3.9.5
aiosignal                1.3.1
async-timeout            4.0.3
attrs                    23.2.0
bitsandbytes             0.40.0
certifi                  2024.6.2
charset-normalizer       3.3.2
click                    8.1.7
datasets                 2.20.0
dill                     0.3.8
docker-pycreds           0.4.0
fast_jl                  0.1.3
filelock                 3.15.4
frozenlist               1.4.1
fsspec                   2024.5.0
gitdb                    4.0.11
GitPython                3.1.43
huggingface-hub          0.23.4
idna                     3.7
Jinja2                   3.1.4
less                     0.1         /mnt/workspace/LESS
MarkupSafe               2.1.5
mpmath                   1.3.0
multidict                6.0.5
multiprocess             0.70.16
networkx                 3.2.1
numpy                    1.26.4
nvidia-cublas-cu12       12.1.3.1
nvidia-cuda-cupti-cu12   12.1.105
nvidia-cuda-nvrtc-cu12   12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12        8.9.2.26
nvidia-cufft-cu12        11.0.2.54
nvidia-curand-cu12       10.3.2.106
nvidia-cusolver-cu12     11.4.5.107
nvidia-cusparse-cu12     12.1.0.106
nvidia-nccl-cu12         2.18.1
nvidia-nvjitlink-cu12    12.5.40
nvidia-nvtx-cu12         12.1.105
packaging                24.1
pandas                   2.2.2
peft                     0.7.1
pip                      24.0
platformdirs             4.2.2
protobuf                 5.27.2
psutil                   6.0.0
pyarrow                  16.1.0
pyarrow-hotfix           0.6
python-dateutil          2.9.0.post0
pytz                     2024.1
PyYAML                   6.0.1
regex                    2024.5.15
requests                 2.32.3
safetensors              0.4.3
scipy                    1.13.1
sentry-sdk               2.7.1
setproctitle             1.3.3
setuptools               69.5.1
six                      1.16.0
smmap                    5.0.1
sympy                    1.12.1
tokenizers               0.15.2
torch                    2.1.2
tqdm                     4.66.4
traker                   0.1.3
transformers             4.36.2
triton                   2.1.0
typing_extensions        4.12.2
tzdata                   2024.1
urllib3                  2.2.2
wandb                    0.17.3
wheel                    0.43.0
xxhash                   3.4.1
yarl                     1.9.4

What should I do to make it run on multi GPUs? By the way it works correctly on a 2 A100 sever though the environment may not be totally the same.

Zrc007 commented 2 months ago

you could change nproc_per_node in less/scripts/train/base_training_args.sh

QinWHang commented 2 weeks ago

How did you solve this problem? I trained this step on two A6000 cards and it got stuck at the same position. [INFO|trainer.py:568] 2024-11-08 10:51:53,438 >> Using auto half precision backend