Closed slee-lab closed 1 year ago
@HamidShojanazeri, just a friendly reminder..
@slee-lab sorry for late reply, yes pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118
nightlies only works with CUDA >11.8. Please give it a try to the nightlies and add --pure_bf16
flag as well.
@HamidShojanazeri I am stuck with a machine with CUDA 11.7. Do you have advise for CUDA < 11.8 to run multiple GPU FT and inferences?
Does the single GPU FT work for you? you can use PEFT+quantization to run the 13B.
1- if you dont need PEFT then FSDP only does not need nightlies, but will need larger compute size. 2- can you pls try to comment out the quantization for inference
Thank you for your input!
quantization = True
right?re: 2.1 yes, unfortunately currently the quantization is running slower you can find more about it here.
re: 2.2 you dont have to run inference with quantization even if you have FT with quantization.
Thanks so much! That resolved all my questions :)
I am getting the same error as well.
--> Running with torch dist debug set to detail
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 11312 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 11313) of binary: /home/ekim/anaconda3/envs/torch/bin/python
Traceback (most recent call last):
File "/home/ekim/anaconda3/envs/torch/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/ekim/anaconda3/envs/torch/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/ekim/anaconda3/envs/torch/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/ekim/anaconda3/envs/torch/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/ekim/anaconda3/envs/torch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ekim/anaconda3/envs/torch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
llama_finetuning.py FAILED
------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-09-05_17:31:01
host : Kserver
rank : 1 (local_rank: 1)
exitcode : -9 (pid: 11313)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 11313
======================================================
I have a dual RTX4090 set up (1 FE, 1 Suprim Liquid) This is the output of my nvidia-smi and the package versions.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA Graphics... On | 00000000:01:00.0 Off | Off |
| 0% 44C P8 22W / 450W | 3MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA Graphics... On | 00000000:03:00.0 Off | Off |
| 0% 36C P8 36W / 480W | 3MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
torch - Version: 2.0.1+cu118
accelerate - Version: 0.21.0
peft - Version: 0.6.0.dev0
trl - Version: 0.6.0
transformers - Version: 4.33.0
bitsandbytes - Version: 0.39.1
I want to fine tune llama2 using qlora using both FSDP and PEFT. Peft training seems to work fine with single GPU but I get the cuda errors with dual GPU. I tried to do inference using hugging face with dual GPU and I get the same error.
Can this happen because the two RTX4090s I have are from different manufacturers? Or could it be a cuda/cudnn issue? Not really sure how to debug this.
@HamidShojanazeri - I am also getting same error... Could you please help on this. Here is the link of the issue i have started for the same. https://github.com/facebookresearch/llama-recipes/issues/182
@ekim322 as I dont have this set up to try/ repro on my end, it might be helpful to take some debugging steps, can you pls first make sure you are. on PyTorch nightlies for using PEFT+FSDP, then try it on one GPU (just LORA+ quantization) as setting are in the repo to make sure basics are running.
@HamidShojanazeri Hi, I had additional question for inference.
I recently found out a peft config inference_mode
which is set default to False
even in inference example code (https://github.com/facebookresearch/llama-recipes/blob/main/src/llama_recipes/configs/peft.py)
I am not sure if this is intended or a bug, could you clarify / share your experience on what does this configuration exactly does?
(https://github.com/search?q=repo%3Ahuggingface%2Fpeft%20inference_mode&type=code)
I am trying to finetuning 7B model with peft (lora)
GPU: A30 24G
torch - Version: 2.0.1+cu117
python -m llama_recipes.finetuning --use_peft --peft_method lora --dataset alpaca_dataset --use_fp16 --model_name checkpoint_hf --output_dir peft_chpt
but it always kill my process, any suggestion, thanks
System Info
Collecting environment information... PyTorch version: 2.0.1+cu117 Is debug build: False CUDA used to build PyTorch: 11.7 ROCM used to build PyTorch: N/A OS: SUSE Linux Enterprise Server 15 SP4 (x86_64) ... Python version: 3.9.17 (main, Jul 5 2023, 20:41:20) [GCC 11.2.0] (64-bit runtime) ... Is CUDA available: True CUDA runtime version: 11.7.64 .... Versions of relevant libraries: [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.25.2 [pip3] pytorch-triton==2.1.0+e6216047b8 [pip3] torch==2.0.1 [conda] numpy 1.25.2 pypi_0 pypi [conda] pytorch-triton 2.1.0+e6216047b8 pypi_0 pypi [conda] torch 2.0.1 pypi_0 pypi
Information
🐛 Describe the bug
@HamidShojanazeri Could you please clarify your PT nightlies? I suppose this might be because I could not follow your instruction on installing PT nightlies because I am stuck with CUDA runtime version 11.7. Per Pytorch.org, I tried
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118
but I suppose it would not work with CUDA 11.7. Could you please advice for fine-tuning using multiple GPUs with PEFT+FSDP in CUDA 11.7 systems?Regarding error potentially rooting form OOM:
Error logs
The error happens during loading the model, before fine-tuning initiation.
Also from Slurm: slurmstepd: error: Detected 1 oom_kill event in XXXX. Some of the step tasks have been OOM Killed.
Expected behavior
Fine-tuning to complete without error