meta-llama / llama-recipes

Scripts for fine-tuning Meta Llama3 with composable FSDP & PEFT methods to cover single/multi-node GPUs. Supports default & custom datasets for applications such as summarization and Q&A. Supporting a number of candid inference solutions such as HF TGI, VLLM for local or cloud deployment. Demo apps to showcase Meta Llama3 for WhatsApp & Messenger.
9.75k stars 1.38k forks source link

Not able to fietune on single node multiple gpu #108

Closed aaekay closed 1 week ago

aaekay commented 9 months ago

System Info

PyTorch version: 2.0.1+cu117 Is debug build: False CUDA used to build PyTorch: 11.7 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.1 LTS (x86_64) GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 Clang version: Could not collect CMake version: version 3.27.0 Libc version: glibc-2.31

Python version: 3.11.4 | packaged by conda-forge | (main, Jun 10 2023, 18:08:17) [GCC 12.2.0] (64-bit runtime) Python platform: Linux-5.13.0-28-generic-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: 11.0.194 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100-SXM4-40GB GPU 1: NVIDIA A100-SXM4-40GB GPU 2: NVIDIA A100-SXM4-40GB GPU 3: NVIDIA A100-SXM4-40GB GPU 4: NVIDIA A100-SXM4-40GB GPU 5: NVIDIA A100-SXM4-40GB GPU 6: NVIDIA A100-SXM4-40GB GPU 7: NVIDIA A100-SXM4-40GB

Nvidia driver version: 470.103.01 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 43 bits physical, 48 bits virtual CPU(s): 64 On-line CPU(s) list: 0-63 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 2 NUMA node(s): 2 Vendor ID: AuthenticAMD CPU family: 23 Model: 49 Model name: AMD EPYC 7282 16-Core Processor Stepping: 0 Frequency boost: enabled CPU MHz: 3195.310 CPU max MHz: 2800.0000 CPU min MHz: 1500.0000 BogoMIPS: 5600.46 Virtualization: AMD-V L1d cache: 1 MiB L1i cache: 1 MiB L2 cache: 16 MiB L3 cache: 128 MiB NUMA node0 CPU(s): 0-15,32-47 NUMA node1 CPU(s): 16-31,48-63 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Full AMD retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es

Versions of relevant libraries: [pip3] numpy==1.25.1 [pip3] torch==2.0.1 [pip3] torch-tb-profiler==0.4.1 [pip3] torchvision==0.15.2 [conda] numpy 1.25.1 pypi_0 pypi [conda] torch 2.0.1 pypi_0 pypi [conda] torch-tb-profiler 0.4.1 pypi_0 pypi [conda] torchvision 0.15.2 pypi_0 pypi

Information

🐛 Describe the bug

Tried running torchrun for single node and multigpu but the process exists

Error logs

--> Running with torch dist debug set to detail WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 611719 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 611720 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 2 (pid: 611721) of binary: /home/amit_g/scratch/env/llm/bin/python Traceback (most recent call last): File "/home/amit_g/scratch/env/llm/bin/torchrun", line 8, in sys.exit(main()) ^^^^^^ File "/home/amit_g/scratch/env/llm/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/amit_g/scratch/env/llm/lib/python3.11/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/amit_g/scratch/env/llm/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/amit_g/scratch/env/llm/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/amit_g/scratch/env/llm/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./llama_finetuning.py FAILED

Failures:

------------------------------------------------------- Root Cause (first observed failure): [0]: time : 2023-08-09_02:46:25 host : user rank : 2 (local_rank: 2) exitcode : -9 (pid: 611721) error_file: traceback : Signal 9 (SIGKILL) received by PID 611721 ======================================================= ### Expected behavior i expected it to train
HamidShojanazeri commented 9 months ago

@aaekay can you pls share the command/ setting (how many GPU/ what type) you are running. Also please note if you are using FSDP + PEFT you need to install PyTorch nightlies.

aaekay commented 9 months ago

@HamidShojanazeri and after using the nightly pytorch , showing similar error:

""" File "/home/amit_g/scratch/env/llm/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==2.1.0.dev20230808', 'console_scripts', 'torchrun')()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/amit_g/scratch/env/llm/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/amit_g/scratch/env/llm/lib/python3.11/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/home/amit_g/scratch/env/llm/lib/python3.11/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/home/amit_g/scratch/env/llm/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/amit_g/scratch/env/llm/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: """

I am using the below command torchrun --nnodes 1 \ --nproc_per_node 3 \ --rdzv_endpoint=localhost:1800 \ ./llama_finetuning.py \ --enable_fsdp \ --use_peft \ --peft_method lora \ --dataset inhouse_dataset \ --batch_size_training 2 \ --num_epochs 10 \ --model_name ../llama/models_hf/70B \ --pure_bf16 \ --output_dir ./tmp/70B

HamidShojanazeri commented 9 months ago

@aaekay I believe you would a bigger compute resources to run the 70B model. Given enough compute, please make use of this PR to bypass the CPU OOM that you would potentially ran into.

HamidShojanazeri commented 1 week ago

It seems be stale, closing this pls feel free to re-open if still seeing the issue.