bpevangelista commented 5 months ago

🐛 Describe the bug

I pip upgraded torch from 2.0.1 to 2.1.2 and without any code changes I now run out of CUDA memory loading Mistral7B on NVIDIA GeForce RTX 3060 TI.

From 2.0.1 to 2.1.2. Did memory allocators or max_split_size_mb change? Does torch reserves more memory?
Any steps I can do to further debug this issue?

From Traceback: I see failing to allocate 20MB and 205MB reserved by unused by PyTorch. Does 2.1.2 reserves more memory? "this process has 17179869184.00 GiB memory" is this GiB or bytes?

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

torch.set_default_device("cuda:0")

tokenizer = AutoTokenizer.from_pretrained(
    'mistralai/Mistral-7B-v0.1',
    trust_remote_code=True,
)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    'mistralai/Mistral-7B-v0.1',
    trust_remote_code=True,
    torch_dtype=torch.float16,
)

generated_tokens = tokenizer("Tell me a joke", return_tensors="pt")

model_params = {
    'do_sample': True,
    'max_new_tokens': 64,
    'input_ids': generated_tokens.input_ids,
    'attention_mask': generated_tokens.attention_mask,
}

output = model.generate(**model_params)
print(tokenizer.batch_decode(output)[0])

Traceback (most recent call last): File "/home/bpevangelista/projects/kfastml/test.py", line 12, in model = AutoModelForCausalLM.from_pretrained( File "/home/bpevangelista/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained return model_class.from_pretrained( File "/home/bpevangelista/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3706, in from_pretrained ) = cls._load_pretrained_model( File "/home/bpevangelista/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4091, in _load_pretrained_model state_dict = load_state_dict(shard_file) File "/home/bpevangelista/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 510, in load_state_dict return safe_load_file(checkpoint_file) File "/home/bpevangelista/.local/lib/python3.10/site-packages/safetensors/torch.py", line 310, in load_file result[k] = f.get_tensor(k) File "/home/bpevangelista/.local/lib/python3.10/site-packages/torch/utils/_device.py", line 77, in __torch_function__ return func(*args, **kwargs) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacty of 8.00 GiB of which 0 bytes is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 22.36 GiB is allocated by PyTorch, and 205.16 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Versions

Collecting environment information... PyTorch version: 2.1.2+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.27.2 Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-5.15.133.1-microsoft-standard-WSL2-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3060 Ti Nvidia driver version: 546.01 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) i7-10700KF CPU @ 3.80GHz CPU family: 6 Model: 165 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 Stepping: 5 BogoMIPS: 7583.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves flush_l1d arch_capabilities Virtualization: VT-x Hypervisor vendor: Microsoft Virtualization type: full L1d cache: 256 KiB (8 instances) L1i cache: 256 KiB (8 instances) L2 cache: 2 MiB (8 instances) L3 cache: 16 MiB (1 instance) Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Retbleed: Mitigation; Enhanced IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Vulnerability Srbds: Unknown: Dependent on hypervisor status Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] numpy==1.24.4 [pip3] torch==2.1.2 [pip3] torchinfo==1.8.0 [pip3] triton==2.1.0 [conda] Could not collect

cc @ezyang @gchanan @zou3519 @kadeng @ptrblck

malfet commented 5 months ago

@bpevangelista how much memory does your 3060 has? (I have an old 2080 I can try this one on) Also, do you mind trying to use torch-2.1.2+cu118 to see if this would work or exhibit the same OOM

Also, can you try setting PYTORCH_NO_CUDA_MEMORY_CACHING environment variable to disable caching allocator (it can negatively affect the perf, but just curious)

bpevangelista commented 5 months ago

@malfet The 3060 has 8GB, I imagine there's some virtual-memory/paging happening as the model should have ~16GB?

I tried what you asked, results:

torch-2.1.2+cu118


# Note. There was an issue with flash-attn so I pip uninstalled it

pip3 install --upgrade torch==2.1.2+cu118 -f https://download.pytorch.org/whl/torch_stable.html python3 pytorch_212_cuda_oom.py # "CUDA out of memory"

export PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:64" python3 pytorch_212_cuda_oom.py # "CUDA out of memory"

- Tried PYTORCH_NO_CUDA_MEMORY_CACHING, which asked me to also use CUDA_LAUNCH_BLOCKING.

export PYTORCH_NO_CUDA_MEMORY_CACHING=1 export CUDA_LAUNCH_BLOCKING=1 python3 pytorch_212_cuda_oom.py

error changed to a shorter version but still OOM. "RuntimeError: CUDA error: out of memory"


- Went back to 2.0.1 but with cu118 instead of cu117, so we can isolate out CUDA.

pip3 install --upgrade torch==2.0.1+cu118 -f https://download.pytorch.org/whl/torch_stable.html python3 pytorch_212_cuda_oom.py # works

VineethBhanukoti commented 5 months ago

@bpevangelista facing the same error for the same specs, any resolution found?

Is it the shortage of 8GB of memory that is causing this issue?

bpevangelista commented 5 months ago

@VineethBhanukoti I reverted back to 2.0.1 for the mean time.

malfet commented 5 months ago

Is transformer version the same for your torch-2.0 and torch-2.1 setups?

ptrblck commented 5 months ago

I cannot reproduce the issue and see the expected OOMs in all releases when limiting the memory to 8GB. Running the script without using torch.cuda.set_per_process_memory_fraction shows a memory requirement of ~24GB:

<s> Tell me a joke!

## “Minecraft is going to be on the new Xbox”

Posted on Saturday Oct 24, 2009

I like it. I can't get anything done. I keep wandering back to our computer to play it. It's not the
|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  14356 MiB |  23806 MiB |  32225 MiB |  17868 MiB |
|       from large pool |  14356 MiB |  23806 MiB |  29960 MiB |  15604 MiB |
|       from small pool |      0 MiB |     18 MiB |   2265 MiB |   2264 MiB |
|---------------------------------------------------------------------------|
| Active memory         |  14356 MiB |  23806 MiB |  32225 MiB |  17868 MiB |
|       from large pool |  14356 MiB |  23806 MiB |  29960 MiB |  15604 MiB |
|       from small pool |      0 MiB |     18 MiB |   2265 MiB |   2264 MiB |
|---------------------------------------------------------------------------|
| Requested memory      |  14356 MiB |  23806 MiB |  32214 MiB |  17858 MiB |
|       from large pool |  14356 MiB |  23806 MiB |  29960 MiB |  15604 MiB |
|       from small pool |      0 MiB |     18 MiB |   2254 MiB |   2254 MiB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |  24044 MiB |  24044 MiB |  24044 MiB |      0 B   |
|       from large pool |  24022 MiB |  24022 MiB |  24022 MiB |      0 B   |
|       from small pool |     22 MiB |     22 MiB |     22 MiB |      0 B   |
|---------------------------------------------------------------------------|
| Non-releasable memory | 159206 KiB | 246984 KiB |   3900 MiB |   3744 MiB |
|       from large pool | 155648 KiB | 245760 KiB |   1528 MiB |   1376 MiB |
|       from small pool |   3558 KiB |  14661 KiB |   2372 MiB |   2368 MiB |
|---------------------------------------------------------------------------|
| Allocations           |     391    |     590    |  105903    |  105512    |
|       from large pool |     291    |     449    |     645    |     354    |
|       from small pool |     100    |     249    |  105258    |  105158    |
|---------------------------------------------------------------------------|
| Active allocs         |     391    |     590    |  105903    |  105512    |
|       from large pool |     291    |     449    |     645    |     354    |
|       from small pool |     100    |     249    |  105258    |  105158    |
|---------------------------------------------------------------------------|
| GPU reserved segments |     373    |     373    |     373    |       0    |
|       from large pool |     362    |     362    |     362    |       0    |
|       from small pool |      11    |      11    |      11    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |      38    |      69    |   41977    |   41939    |
|       from large pool |      35    |      56    |     159    |     124    |
|       from small pool |       3    |      34    |   41818    |   41815    |
|---------------------------------------------------------------------------|
| Oversize allocations  |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize GPU segments |       0    |       0    |       0    |       0    |
|===========================================================================|

using torch.cuda.set_per_process_memory_fraction shows:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacty of 63.29 GiB of which 54.76 GiB is free. Process 2641540 has 8.52 GiB memory in use. Of the allocated memory 7.84 GiB is allocated by PyTorch, and 97.71 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

in 2.1.2+cu121 and

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 63.29 GiB total capacity; 7.84 GiB already allocated; 54.76 GiB free; 7.95 GiB allowed; 7.93 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

in 2.0.1+cu118.

bpevangelista commented 5 months ago

@malfet The transformers version is the same, only libraries that changed are pytorch and triton.

@ptrblck If I force GPU memory to 8GB the sample I provided will not work on both. The 7B 16b-param model needs close to 15GB, plus KV cache and scratch memory.

Correct me if I'm wrong but the way pytorch+cuda works is that you reserve memory and then mmap it to GPU. Thus, an 8GB GPU could see as much memory as my RAM (32GB). Below is my GPU memory on 2.0.1 showing 14.5GB on a 8GB GPU.

|---------------------------------------------------------------------------|
| GPU reserved memory   |  14510 MiB |  14510 MiB |  14510 MiB |      0 B   |
|       from large pool |  14508 MiB |  14508 MiB |  14508 MiB |      0 B   |
|       from small pool |      2 MiB |      2 MiB |      2 MiB |      0 B   |
|---------------------------------------------------------------------------|

@malfet I saw some 3mo old changes on CUDA allocator related to pinning memory instead of mapping, it appears disabled by default but wondering if that was part of the issue.

ptrblck commented 5 months ago

Correct me if I'm wrong but the way pytorch+cuda works is that you reserve memory and then mmap it to GPU. Thus, an 8GB GPU could see as much memory as my RAM (32GB). Below is my GPU memory on 2.0.1 showing 14.5GB on a 8GB GPU.

No, PyTorch will use a caching mechanism to reuse memory, but will not offload GPU memory to the host by default via e.g. managed memory. Thus, I still don't understand how you can allocate more than 8GB on your device without changing the memory allocations. Are you also able to create a 14GB tensor on this device?

bpevangelista commented 5 months ago

@ptrblck Sorry, I forgot to reply. Yes, I can allocate a 14GB tensor on GPU "cuda:0" and I can allocate 14x1GB Tensors as well.

On both cases, Torch shows ~14GB reserved GPU memory while nvidia-smi would show close to 8BG/8GB memory. Torch does appear to fail to allocate close to 3x my VRAM-size. Thanks

jseb3 commented 4 months ago

Same issue, with WSL & RTX 3060 Ti, Pytorch 2.1.2 doesn't work (same error) and 2.0.1 works.

pytorch / pytorch

Mistral works on (2.0.1+cu117) but "CUDA out of memory" on (2.1.2+cu121) #116928

🐛 Describe the bug

Versions

error changed to a shorter version but still OOM. "RuntimeError: CUDA error: out of memory"