Fine-tuning using multiple GPUs one node (PEFT+FSDP)

slee-lab commented 1 year ago

System Info

Collecting environment information... PyTorch version: 2.0.1+cu117 Is debug build: False CUDA used to build PyTorch: 11.7 ROCM used to build PyTorch: N/A OS: SUSE Linux Enterprise Server 15 SP4 (x86_64) ... Python version: 3.9.17 (main, Jul 5 2023, 20:41:20) [GCC 11.2.0] (64-bit runtime) ... Is CUDA available: True CUDA runtime version: 11.7.64 .... Versions of relevant libraries: [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.25.2 [pip3] pytorch-triton==2.1.0+e6216047b8 [pip3] torch==2.0.1 [conda] numpy 1.25.2 pypi_0 pypi [conda] pytorch-triton 2.1.0+e6216047b8 pypi_0 pypi [conda] torch 2.0.1 pypi_0 pypi

Information

[X] The official example scripts
[ ] My own modified scripts

🐛 Describe the bug

@HamidShojanazeri Could you please clarify your PT nightlies? I suppose this might be because I could not follow your instruction on installing PT nightlies because I am stuck with CUDA runtime version 11.7. Per Pytorch.org, I tried pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118 but I suppose it would not work with CUDA 11.7. Could you please advice for fine-tuning using multiple GPUs with PEFT+FSDP in CUDA 11.7 systems?

Regarding error potentially rooting form OOM:

Codes: The script I am using is up to date (git cloned recently (from about 8/20) so reflects PRs for oom updates)
System: I am using 4 x A100 (40GB) and also tried with more, to run Llama-2-13B model with PEFT+FSDP so I should have enough memory.

Error logs

The error happens during loading the model, before fine-tuning initiation.

torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
========================================================
llama_finetuning.py FAILED
--------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 
  host      : 
  rank      : 1 (local_rank: 1)
  exitcode  : -9 (pid: 2038283)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 2038283

Also from Slurm: slurmstepd: error: Detected 1 oom_kill event in XXXX. Some of the step tasks have been OOM Killed.

Expected behavior

Fine-tuning to complete without error

slee-lab commented 1 year ago

@HamidShojanazeri, just a friendly reminder..

could you please clarify the PT nightlies versions that are needed for multiple GPU FT with FSDP+PEFT when you pip freeze?
can not installing the correct nightlies result in OOM kill error or is this from something else?

HamidShojanazeri commented 1 year ago

@slee-lab sorry for late reply, yes pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118 nightlies only works with CUDA >11.8. Please give it a try to the nightlies and add --pure_bf16 flag as well.

slee-lab commented 1 year ago

@HamidShojanazeri I am stuck with a machine with CUDA 11.7. Do you have advise for CUDA < 11.8 to run multiple GPU FT and inferences?

HamidShojanazeri commented 1 year ago

Does the single GPU FT work for you? you can use PEFT+quantization to run the 13B.

slee-lab commented 1 year ago

Yes, they do work but I wanted to try without PEFT and without quantization
Also, inferences tend to take surprisingly too long time even for the base model so I was hoping for some multiple GPU support

HamidShojanazeri commented 1 year ago

1- if you dont need PEFT then FSDP only does not need nightlies, but will need larger compute size. 2- can you pls try to comment out the quantization for inference

slee-lab commented 1 year ago

Thank you for your input!

I will try fine-tuning with FSDP only with multiple GPUs then
Wow, I did not know about this... 2-1. To clarify do you mean if I turn off quantization it will speed up the inference? 2-2. Also, if I make inference from a PEFT model fine-tuned with quantization, then making inference with it should also have quantization = True right?

HamidShojanazeri commented 1 year ago

re: 2.1 yes, unfortunately currently the quantization is running slower you can find more about it here.

re: 2.2 you dont have to run inference with quantization even if you have FT with quantization.

slee-lab commented 1 year ago

Thanks so much! That resolved all my questions :)

ekim322 commented 1 year ago

I am getting the same error as well.

--> Running with torch dist debug set to detail
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 11312 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 11313) of binary: /home/ekim/anaconda3/envs/torch/bin/python
Traceback (most recent call last):
  File "/home/ekim/anaconda3/envs/torch/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/ekim/anaconda3/envs/torch/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/ekim/anaconda3/envs/torch/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/ekim/anaconda3/envs/torch/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/ekim/anaconda3/envs/torch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ekim/anaconda3/envs/torch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
llama_finetuning.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-09-05_17:31:01
  host      : Kserver
  rank      : 1 (local_rank: 1)
  exitcode  : -9 (pid: 11313)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 11313
======================================================

I have a dual RTX4090 set up (1 FE, 1 Suprim Liquid) This is the output of my nvidia-smi and the package versions.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA Graphics...  On   | 00000000:01:00.0 Off |                  Off |
|  0%   44C    P8    22W / 450W |      3MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA Graphics...  On   | 00000000:03:00.0 Off |                  Off |
|  0%   36C    P8    36W / 480W |      3MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

torch - Version: 2.0.1+cu118
accelerate - Version: 0.21.0
peft - Version: 0.6.0.dev0
trl - Version: 0.6.0
transformers - Version: 4.33.0
bitsandbytes - Version: 0.39.1

I want to fine tune llama2 using qlora using both FSDP and PEFT. Peft training seems to work fine with single GPU but I get the cuda errors with dual GPU. I tried to do inference using hugging face with dual GPU and I get the same error.

Can this happen because the two RTX4090s I have are from different manufacturers? Or could it be a cuda/cudnn issue? Not really sure how to debug this.

singhalshikha518 commented 1 year ago

@HamidShojanazeri - I am also getting same error... Could you please help on this. Here is the link of the issue i have started for the same. https://github.com/facebookresearch/llama-recipes/issues/182

HamidShojanazeri commented 1 year ago

@ekim322 as I dont have this set up to try/ repro on my end, it might be helpful to take some debugging steps, can you pls first make sure you are. on PyTorch nightlies for using PEFT+FSDP, then try it on one GPU (just LORA+ quantization) as setting are in the repo to make sure basics are running.

slee-lab commented 1 year ago

@HamidShojanazeri Hi, I had additional question for inference. I recently found out a peft config inference_mode which is set default to False even in inference example code (https://github.com/facebookresearch/llama-recipes/blob/main/src/llama_recipes/configs/peft.py) I am not sure if this is intended or a bug, could you clarify / share your experience on what does this configuration exactly does? (https://github.com/search?q=repo%3Ahuggingface%2Fpeft%20inference_mode&type=code)

thanhsang298 commented 10 months ago

I am trying to finetuning 7B model with peft (lora) GPU: A30 24G torch - Version: 2.0.1+cu117 python -m llama_recipes.finetuning --use_peft --peft_method lora --dataset alpaca_dataset --use_fp16 --model_name checkpoint_hf --output_dir peft_chpt but it always kill my process, any suggestion, thanks

meta-llama / llama-recipes