Training running out of memory on 1st backward pass of 2nd epoch

mariokostelac commented 10 months ago

System Info

System info

Collecting environment information...
PyTorch version: 2.1.0+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.10.13 | packaged by conda-forge | (main, Oct 26 2023, 18:07:37) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-4.14.327-246.539.amzn2.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A10G
GPU 1: NVIDIA A10G
GPU 2: NVIDIA A10G
GPU 3: NVIDIA A10G

Nvidia driver version: 470.57.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      48 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             96
On-line CPU(s) list:                0-95
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 7R32
CPU family:                         23
Model:                              49
Thread(s) per core:                 2
Core(s) per socket:                 48
Socket(s):                          1
Stepping:                           0
BogoMIPS:                           5599.90
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_save rdpid
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          1.5 MiB (48 instances)
L1i cache:                          1.5 MiB (48 instances)
L2 cache:                           24 MiB (48 instances)
L3 cache:                           192 MiB (12 instances)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-95
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Vulnerable, RAS-Poisoning: Vulnerable
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.2
[pip3] torch==2.1.0+cu118
[pip3] triton==2.1.0
[conda] Could not collect

Information

[ ] The official example scripts
[X] My own modified scripts

🐛 Describe the bug

I'm running a PEFT finetuning on 13b model (all setting visible in logs) and it's OOM-ing on the first backward pass of the second epoch.

What confuses me most is that

Why does it fail at the start of 2nd epoch?
It's using a constant example size of 4096 tokens (padded), which should keep memory usage pretty much the same.
Why does it try to allocate that much memory again? It's almost as something hasn't been released from epoch 1.
I remember I was running it successfully on older commits of the repo.

The script I use is modified finetuning.py, but the only difference is in loading config from yaml (similar to axolotl). Final config dataclasses are printed in stdout logs (attached below).

Stdout logs (all settings visible there)

``` Loading train_config Loading fsdp_config Loading intercom_config Loading dataset_config Loading lora_config Loading llama_adapter_config Loading prefix_config Loading inference_config --> Running training with 4 GPUs train_config: --> model_name=NousResearch/Llama-2-13b-hf --> enable_fsdp=True --> low_cpu_fsdp=False --> run_validation=True --> batch_size_training=1 --> batching_strategy=padding --> context_length=4096 --> gradient_accumulation_steps=4 --> gradient_clipping=False --> gradient_clipping_threshold=1.0 --> num_epochs=3 --> num_workers_dataloader=1 --> lr=0.0001 --> weight_decay=0.0 --> gamma=0.85 --> seed=42 --> use_fp16=False --> mixed_precision=True --> val_batch_size=1 --> peft_method=lora --> use_peft=True --> output_dir=/opt/ml/output/data/peft_model --> freeze_layers=False --> num_freeze_layers=1 --> quantization=False --> one_gpu=False --> save_model=True --> dist_checkpoint_root_folder=PATH/to/save/FSDP/model --> dist_checkpoint_folder=fine-tuned --> save_optimizer=False --> use_fast_kernels=True fsdp_config: --> mixed_precision=True --> use_fp16=False --> sharding_strategy=ShardingStrategy.FULL_SHARD --> checkpoint_type=StateDictType.SHARDED_STATE_DICT --> fsdp_activation_checkpointing=True --> fsdp_cpu_offload=False --> pure_bf16=True --> optimizer=AdamW dataset_config: --> dataset=custom_dataset --> file=dataset_loader.py --> train_split=train --> test_split=validation --> train_file=s3://sensitive-dev-experiments/data/llama2_finetuning/qa/011-first200/llama/train.jsonl --> validation_file=s3://sensitive-dev-experiments/data/llama2_finetuning/qa/011-first200/llama/validation.jsonl --> inference_file=s3://sensitive-dev-experiments/data/llama2_finetuning/qa/011-first200/llama/inference.jsonl --> max_context_size=4096 --> pack_examples=True Clearing GPU cache for all ranks --> Running with torch dist debug set to detail Loading model on node 0 --> Model NousResearch/Llama-2-13b-hf --> NousResearch/Llama-2-13b-hf has 13015.86432 Million params LoraConfig: --> peft_type=LORA --> auto_mapping=None --> base_model_name_or_path=None --> revision=None --> task_type=CAUSAL_LM --> inference_mode=False --> r=8 --> target_modules={'v_proj', 'q_proj'} --> lora_alpha=16 --> lora_dropout=0.05 --> fan_in_fan_out=False --> bias=none --> modules_to_save=None --> init_lora_weights=True --> layers_to_transform=None --> layers_pattern=None --> rank_pattern={} --> alpha_pattern={} trainable params: 6,553,600 || all params: 13,022,417,920 || trainable%: 0.05032552357220002 bFloat16 enabled for mixed precision - using bfSixteen policy --> applying fsdp activation checkpointing... --> Training Set Length = 146 Filtered 2 examples that were too long --> Validation Set Length = 137 Max CUDA memory allocated was 19 GB Max CUDA memory reserved was 20 GB Peak active CUDA memory was 19 GB Cuda Malloc retires : 1 CPU Total Peak Memory consumed during the train (max): 5 GB eval_ppl=tensor(1.9105, device='cuda:0') eval_epoch_loss=tensor(0.6474, device='cuda:0') we are about to save the PEFT modules PEFT modules are saved in /opt/ml/output/data/peft_model directory best eval loss on epoch 1 is 0.6473564505577087 Epoch 1: train_perplexity=1.2518, train_epoch_loss=0.2246, epoch time 584.0374154910005s ```

Full std err output

``` Loading checkpoint shards: 0%| | 0/3 [00:00 fire.Fire(main) File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/home/sagemaker-user/finetuning-llama2/llama_finetuning.py", line 254, in main results = train( File "/home/sagemaker-user/finetuning-llama2/vendor/llama-recipes/src/llama_recipes/utils/train_utils.py", line 104, in train loss.backward() File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 10.08 GiB. GPU 0 has a total capacty of 22.20 GiB of which 687.12 MiB is free. Process 40666 has 21.53 GiB memory in use. Of the allocated memory 9.71 GiB is allocated by PyTorch, and 10.61 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Exception raised from malloc at ../c10/cuda/CUDACachingAllocator.cpp:1438 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3b5b087617 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0x3935c (0x7f3b5b11b35c in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so) frame #2: + 0x3978e (0x7f3b5b11b78e in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so) frame #3: + 0x39b4e (0x7f3b5b11bb4e in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so) frame #4: + 0x17ba161 (0x7f3b41e33161 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #5: at::detail::empty_generic(c10::ArrayRef, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, c10::optional) + 0x14 (0x7f3b41e2b374 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #6: at::detail::empty_cuda(c10::ArrayRef, c10::ScalarType, c10::optional, c10::optional) + 0x111 (0x7f3aee8273e1 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #7: at::detail::empty_cuda(c10::ArrayRef, c10::optional, c10::optional, c10::optional, c10::optional, c10::optional) + 0x31 (0x7f3aee8276b1 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #8: at::native::empty_cuda(c10::ArrayRef, c10::optional, c10::optional, c10::optional, c10::optional, c10::optional) + 0x20 (0x7f3aee954170 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #9: + 0x31a0a29 (0x7f3af0764a29 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #10: + 0x31a0b0b (0x7f3af0764b0b in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #11: at::_ops::empty_memory_format::redispatch(c10::DispatchKeySet, c10::ArrayRef, c10::optional, c10::optional, c10::optional, c10::optional, c10::optional) + 0xe7 (0x7f3b42d58c27 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #12: + 0x2a5749f (0x7f3b430d049f in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #13: at::_ops::empty_memory_format::call(c10::ArrayRef, c10::optional, c10::optional, c10::optional, c10::optional, c10::optional) + 0x1a3 (0x7f3b42d9cd93 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #14: + 0x127b20b (0x7f3aee83f20b in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #15: + 0x312c72d (0x7f3af06f072d in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #16: at::native::_efficient_attention_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::optional const&, at::Tensor const&, c10::optional const&, c10::optional const&, long, long, at::Tensor const&, double, at::Tensor const&, at::Tensor const&, long, bool, c10::optional, c10::optional) + 0x1f33 (0x7f3af0703683 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #17: + 0x31b9aad (0x7f3af077daad in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #18: + 0x31b9ba7 (0x7f3af077dba7 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #19: at::_ops::_efficient_attention_backward::call(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::optional const&, at::Tensor const&, c10::optional const&, c10::optional const&, long, long, at::Tensor const&, double, at::Tensor const&, at::Tensor const&, long, bool, c10::optional, c10::optional) + 0x26c (0x7f3b42c58b3c in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #20: at::native::_scaled_dot_product_efficient_attention_backward_cuda(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, std::array, bool, c10::optional) + 0x255 (0x7f3af059ef75 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #21: + 0x31b3c4c (0x7f3af0777c4c in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #22: + 0x3390cd3 (0x7f3af0954cd3 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #23: + 0x4ca79bb (0x7f3b453209bb in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #24: + 0x4ca61fd (0x7f3b4531f1fd in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #25: + 0x2618e82 (0x7f3b42c91e82 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #26: at::_ops::_scaled_dot_product_efficient_attention_backward::call(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, std::array, bool, c10::optional) + 0x400 (0x7f3b42c58310 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #27: torch::autograd::generated::ScaledDotProductEfficientAttentionBackward0::apply(std::vector >&&) + 0x47a (0x7f3b446c07aa in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #28: + 0x4cc127b (0x7f3b4533a27b in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #29: torch::autograd::Engine::evaluate_function(std::shared_ptr&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr const&) + 0xe8d (0x7f3b4533353d in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #30: torch::autograd::Engine::thread_main(std::shared_ptr const&) + 0x698 (0x7f3b45334898 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #31: torch::autograd::Engine::thread_init(int, std::shared_ptr const&, bool) + 0x96 (0x7f3b4532b5c6 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) frame #32: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr const&, bool) + 0x5c (0x7f3b58387d6c in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so) frame #33: + 0xdc253 (0x7f3b5a8b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #34: + 0x94ac3 (0x7f3b8667eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #35: clone + 0x44 (0x7f3b8670fbf4 in /usr/lib/x86_64-linux-gnu/libc.so.6) Training Epoch: 2/3, step 0/36 completed (loss: 0.20571832358837128): 0%|[34m [0m| 0/9 [00:24

Error logs

Traceback (most recent call last):
  File "/home/sagemaker-user/finetuning-llama2/llama_finetuning.py", line 271, in <module>
    fire.Fire(main)
  File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/sagemaker-user/finetuning-llama2/llama_finetuning.py", line 254, in main
    results = train(
  File "/home/sagemaker-user/finetuning-llama2/vendor/llama-recipes/src/llama_recipes/utils/train_utils.py", line 104, in train
    loss.backward()
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 10.08 GiB. GPU 0 has a total capacty of 22.20 GiB of which 687.12 MiB is free. Process 40666 has 21.53 GiB memory in use. Of the allocated memory 9.71 GiB is allocated by PyTorch, and 10.61 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Exception raised from malloc at ../c10/cuda/CUDACachingAllocator.cpp:1438 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3b5b087617 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x3935c (0x7f3b5b11b35c in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x3978e (0x7f3b5b11b78e in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x39b4e (0x7f3b5b11bb4e in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x17ba161 (0x7f3b41e33161 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #5: at::detail::empty_generic(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, c10::optional<c10::MemoryFormat>) + 0x14 (0x7f3b41e2b374 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: at::detail::empty_cuda(c10::ArrayRef<long>, c10::ScalarType, c10::optional<c10::Device>, c10::optional<c10::MemoryFormat>) + 0x111 (0x7f3aee8273e1 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: at::detail::empty_cuda(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x31 (0x7f3aee8276b1 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #8: at::native::empty_cuda(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x20 (0x7f3aee954170 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #9: <unknown function> + 0x31a0a29 (0x7f3af0764a29 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #10: <unknown function> + 0x31a0b0b (0x7f3af0764b0b in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #11: at::_ops::empty_memory_format::redispatch(c10::DispatchKeySet, c10::ArrayRef<c10::SymInt>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0xe7 (0x7f3b42d58c27 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x2a5749f (0x7f3b430d049f in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #13: at::_ops::empty_memory_format::call(c10::ArrayRef<c10::SymInt>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x1a3 (0x7f3b42d9cd93 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x127b20b (0x7f3aee83f20b in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #15: <unknown function> + 0x312c72d (0x7f3af06f072d in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #16: at::native::_efficient_attention_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, long, long, at::Tensor const&, double, at::Tensor const&, at::Tensor const&, long, bool, c10::optional<double>, c10::optional<long>) + 0x1f33 (0x7f3af0703683 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #17: <unknown function> + 0x31b9aad (0x7f3af077daad in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #18: <unknown function> + 0x31b9ba7 (0x7f3af077dba7 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #19: at::_ops::_efficient_attention_backward::call(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, long, long, at::Tensor const&, double, at::Tensor const&, at::Tensor const&, long, bool, c10::optional<double>, c10::optional<long>) + 0x26c (0x7f3b42c58b3c in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #20: at::native::_scaled_dot_product_efficient_attention_backward_cuda(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, std::array<bool, 4ul>, bool, c10::optional<double>) + 0x255 (0x7f3af059ef75 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #21: <unknown function> + 0x31b3c4c (0x7f3af0777c4c in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #22: <unknown function> + 0x3390cd3 (0x7f3af0954cd3 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #23: <unknown function> + 0x4ca79bb (0x7f3b453209bb in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #24: <unknown function> + 0x4ca61fd (0x7f3b4531f1fd in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #25: <unknown function> + 0x2618e82 (0x7f3b42c91e82 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #26: at::_ops::_scaled_dot_product_efficient_attention_backward::call(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, std::array<bool, 4ul>, bool, c10::optional<double>) + 0x400 (0x7f3b42c58310 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #27: torch::autograd::generated::ScaledDotProductEfficientAttentionBackward0::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x47a (0x7f3b446c07aa in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #28: <unknown function> + 0x4cc127b (0x7f3b4533a27b in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #29: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0xe8d (0x7f3b4533353d in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #30: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x698 (0x7f3b45334898 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #31: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x96 (0x7f3b4532b5c6 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #32: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x5c (0x7f3b58387d6c in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #33: <unknown function> + 0xdc253 (0x7f3b5a8b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #34: <unknown function> + 0x94ac3 (0x7f3b8667eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #35: clone + 0x44 (0x7f3b8670fbf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Expected behavior

Peak memory usage stays the same in different epochs and training finishes successfully.

mariokostelac commented 10 months ago

Ok I've found that the same script and (what is supposed to be) the same environment (GPUs are the asme) doesn't fail on SageMaker training (batch interface), but fails on SageMaker Studio (interactive interface).

I'll start printing output of

python -m "torch.utils.collect_env"
nvidia-smi

to see whether there are some notable differences.

What'd be the best way to find usual culprits in environment differences?

bilaalmirza commented 10 months ago

To identify potential differences in the environment between SageMaker training and SageMaker Studio, you can print the output of the following commands:

bash Copy code python -m "torch.utils.collect_env" nvidia-smi Comparing the output from these commands in both environments may reveal any variations that could be causing the script to fail in SageMaker Studio. Look for differences in Python packages, CUDA versions, or GPU information. This approach helps pinpoint environmental factors contributing to the issue.

mariokostelac commented 10 months ago

I've found that SageMaker Studio runs older drivers and cuda 11.8. It also ends up having less memory available (~200MB) for the same GPU. AWS responded that it's caused by some internal complexity (haven't disclosed what exactly).

SageMaker jobs get newer drivers, cuda 12, and extra 200MB GPU memory available so that run succeeds. I think running with parameters I chose was just on the edge of available gpu ram, and reducing it by 200MB tipped it over.

If it's expected that 2nd epoch needs a bit more gpu RAM, I think we can close the issue. I've spent a bit of time looking into it and found that PEFT cloning consumes a bit of RAM. It's unclear why it's not returned back to the pool before 2nd epoch starts, but that might be expected.

HamidShojanazeri commented 9 months ago

@mariokostelac this is not expected that second consume more memory however, memory allocation of PyTorch may impact it by fragmenting the memory, I wonder if using this flag would help you further.

init27 commented 2 months ago

Please feel free to re-open if you still have an issue!

meta-llama / llama-recipes