Closed TheBloke closed 6 months ago
@TheBloke Thanks for reporting the issue.
Currently, AWQ only support FP16 because the AWQ's GEMM kernel only supports it. You can use half
instead of float16
for now. We will allow float16
once #1115 is merged. Besides, we are planning to replace AWQ CUDA kernels with more optimized and general implementation. Sorry for the inconvenience.
Ahh, so half
is the same as float16
? For some reason I thought that was int8 (half of float16), my bad!
OK I will add a note to my README that users should add --dtype half
to ensure all models work.
Many thanks for the fast reply.
@TheBloke Yeah, if I understand correctly, half
is equal to float16
at least in the context of deep learning.
I've tried your model https://huggingface.co/TheBloke/Spicyboros-13B-2.2-AWQ with half
, and it looks the model is generating valid output sentences. Hope this can be a temporary fix before we improve the kernels.
Besides, we are planning to replace AWQ CUDA kernels with more optimized and general implementation. Sorry for the inconvenience.
@WoosukKwon If you need to create a new format for the INT4 packed weights to optimize throughput, let me know and we can work this into AutoAWQ as a new format to optimize throughput.
INT8 can run faster because it can utilize the INT8 tensor cores, but unfortunately INT4 cannot do the same yet and you need to dequantize to run operations in FP16.
Best of luck with the improved implementation.
we are planning to replace AWQ CUDA kernels with more optimized and general implementation.
Is giving support to GPUs with lesser capabilities than 7.5 within the roadmap?
When I use the above method for inference with Codellama, I encounter CUDA kernel errors. Please help me understand why?
WARNING: WatchFiles detected changes in 'fastapi_vllm_codellama.py'. Reloading... INFO 10-31 16:58:55 llm_engine.py:72] Initializing an LLM engine with config: model='./CodeLlama-13B-AWQ', tokenizer='./CodeLlama-13B-AWQ', tokenizer_mode=auto, revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0) INFO 10-31 16:58:55 tokenizer.py:30] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer. Process SpawnProcess-46: Traceback (most recent call last): File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/uvicorn/_subprocess.py", line 76, in subprocess_started target(sockets=sockets) File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/uvicorn/server.py", line 61, in run return asyncio.run(self.serve(sockets=sockets)) File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/uvicorn/server.py", line 68, in serve config.load() File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/uvicorn/config.py", line 473, in load self.loaded_app = import_from_string(self.app) File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/uvicorn/importer.py", line 21, in import_from_string module = importlib.import_module(module_str) File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1050, in _gcd_import File "<frozen importlib._bootstrap>", line 1027, in _find_and_load File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 688, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 883, in exec_module File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed File "/mnt/gpu/code/fastapi_vllm_codellama.py", line 22, in <module> llm = LLM(model="./CodeLlama-13B-AWQ", quantization="awq") File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 89, in __init__ self.llm_engine = LLMEngine.from_engine_args(engine_args) File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 229, in from_engine_args engine = cls(*engine_configs, File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 111, in __init__ self._init_cache() File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 191, in _init_cache num_blocks = self._run_workers( File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 692, in _run_workers output = executor(*args, **kwargs) File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/worker/worker.py", line 109, in profile_num_available_blocks self.model( File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 297, in forward hidden_states = self.model(input_ids, positions, kv_caches, File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 257, in forward hidden_states = layer( File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 216, in forward hidden_states = self.mlp(hidden_states) File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 81, in forward gate_up, _ = self.gate_up_proj(x) File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/model_executor/parallel_utils/tensor_parallel/layers.py", line 238, in forward output_parallel = self.apply_weights(input_parallel, bias) File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/model_executor/layers/quantized_linear/awq.py", line 55, in apply_weights out = quantization_ops.awq_gemm(reshaped_x, self.qweight, self.scales, RuntimeError: CUDA error: an illegal memory access was encountered Compile with
TORCH_USE_CUDA_DSAto enable device-side assertions.
ValueError: No available memory for the cache blocks. Try increasing gpu_memory_utilization
when initializing the engine.
I just try to load a simple gptq 7B model and I have a GPU with 12G of VRAM. I don't understand why it is doing this!
Closing because --dtype
does now support specifying float16
ValueError: No available memory for the cache blocks. Try increasing
gpu_memory_utilization
when initializing the engine. I just try to load a simple gptq 7B model and I have a GPU with 12G of VRAM. I don't understand why it is doing this!
I seem to have the same problem trying to run this model https://huggingface.co/hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 with dtype=float16
and ray cluster
Hi guys
I had a report earlier today from a user telling me that he tried one of my new AWQ models, and got an error indicating that only float16 is supported with AWQ.
I tested it myself with the server and found the same, eg trying to run: https://huggingface.co/TheBloke/Spicyboros-13B-2.2-AWQ gives this output:
Firstly: is it expected that AWQ will fail to load as bfloat16? Could that be supported?
Right now the only solution for the user is to download the model and manually edit
config.json
to settorch_dtype=float16
, which is a bit of a pain.So, secondly: could we get a
--dtype float16
option so at least it can be easily avoided with an option? The valid options for--dtype
are:'auto', 'half', 'bfloat16', 'float'
- there's no way to specify float16 (as I guess it assumes that that's the default.)I could update config.json in all my AWQ repos, to change any
bfloat16
tofloat16
instead, but first it'd be good to know how easy it would be to support bfloat16.Thanks