turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.74k stars 215 forks source link

Is Tesla T4 supported? #277

Closed ivsanro1 closed 1 year ago

ivsanro1 commented 1 year ago

Hello. I have run exllama fine in a NVIDIA L4, but now I'm trying to run the same in a Tesla T4 and I receive the error:

File ~/.local/lib/python3.10/site-packages/exllama/model.py:898, in ExLlama.__init__(self, config)
    895 device_buffers = {}
    896 self.buffers.append(device_buffers)
--> 898 temp_state = torch.zeros((config.max_input_len, config.intermediate_size), dtype = torch.float16, device = dev)
    899 temp_mlp = torch.zeros((config.fused_mlp_thd * 2, config.intermediate_size), dtype = torch.float16, device = dev)
    900 temp_zeros_float = torch.zeros((1, 65536), dtype = torch.float32, device = dev)

RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

AFAIU, T4 has proper fp16 support. I say because of this in the README:

I am developing on an RTX 4090 and an RTX 3090-Ti. 30-series and later NVIDIA GPUs should be well supported, but anything Pascal or older with poor FP16 support isn't going to perform well.

Do you know if exllama should work in a T4?

My versions:

!python3 -m pip show torch

Name: torch
Version: 2.0.1+cu118
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /usr/local/lib/python3.10/dist-packages
Requires: filelock, jinja2, networkx, sympy, triton, typing-extensions
Required-by: accelerate, auto-gptq, exllama, peft, torchaudio, torchvision, triton
!nvidia-smi

Thu Sep  7 13:54:33 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.182.03   Driver Version: 470.182.03   CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   67C    P0    31W /  70W |   7419MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
ivsanro1 commented 1 year ago

For some reason after the first error, it does not fail anymore. With this it works:

for dev in self.config.device_map.get_layers_devs():

    device_buffers = {}
    self.buffers.append(device_buffers)

    try:
        torch.zeros((5, 5), dtype = torch.float16, device = dev)
    except:
        pass

    temp_state = torch.zeros((config.max_input_len, config.intermediate_size), dtype = torch.float16, device = dev)
    temp_mlp = torch.zeros((config.fused_mlp_thd * 2, config.intermediate_size), dtype = torch.float16, device = dev)
    temp_zeros_float = torch.zeros((1, 65536), dtype = torch.float32, device = dev)
    temp_dq = torch.zeros((1, max_dq_buffer_size), dtype = torch.float16, device = dev)

However, it still fails with the same error in other places of the code.

Compute capability of T4 is 7.5 so I imagine it should work

ivsanro1 commented 1 year ago

Closing because it's not exllama related