tloen / llama-int8

Quantized inference code for LLaMA models
GNU General Public License v3.0
1.05k stars 105 forks source link

RTX4090 CUDA out of memory. #7

Closed WuNein closed 1 year ago

WuNein commented 1 year ago

I am using the latest version of nvidia-docker of pytorch, with support for cuda 12. I complie the cuda 118 version of bit lib, since the code require bitxxx_cuda118.so . Tested on 7B version, OK. 13B, CUDA out of memory. About 1-2G less.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 68.00 MiB (GPU 0; 23.65 GiB total capacity; 22.68 GiB already allocated; 41.31 MiB free; 23.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

No OOM error, 64Gb memory installed.

I doubt whether RTX4090 can actually run 13B model. Please share more detailed imformation of your device.

ElRoberto538 commented 1 year ago

Just a random guess as I haven't tried yet, but is ECC enabled on your card? Try disabling it with nvidia-smi -e 0, re-enable with nvidia-smi -e 1.

WuNein commented 1 year ago

Just a random guess as I haven't tried yet, but is ECC enabled on your card? Try disabling it with nvidia-smi -e 0, re-enable with nvidia-smi -e 1.

4090 do not have ecc... the first checkpoint take 13G, the rest is not enough.

WuNein commented 1 year ago

change max_seq_len to 256, which is able to run 13B on 4090.