vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.63k stars 4.07k forks source link

AWQ: bfloat16 not supported? And `--dtype` arg doesn't allow specifying float16 #1114

Closed TheBloke closed 6 months ago

TheBloke commented 1 year ago

Hi guys

I had a report earlier today from a user telling me that he tried one of my new AWQ models, and got an error indicating that only float16 is supported with AWQ.

I tested it myself with the server and found the same, eg trying to run: https://huggingface.co/TheBloke/Spicyboros-13B-2.2-AWQ gives this output:

INFO 09-20 19:09:33 llm_engine.py:72] Initializing an LLM engine with config: model='TheBloke/Spicyboros-13B-2.2-AWQ', tokenizer='TheBloke/Spicyboros-13B-2.2-AWQ', tokenizer_mode=auto, revision=None, trust_remote_code=False, dtype=torch.bfloat16, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0)
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/vllm/vllm/vllm/entrypoints/api_server.py", line 83, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/home/vllm/vllm/vllm/engine/async_llm_engine.py", line 486, in from_engine_args
    engine = cls(engine_args.worker_use_ray,
  File "/home/vllm/vllm/vllm/engine/async_llm_engine.py", line 270, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/vllm/vllm/vllm/engine/async_llm_engine.py", line 306, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/vllm/vllm/vllm/engine/llm_engine.py", line 105, in __init__
    self._init_workers(distributed_init_method)
  File "/home/vllm/vllm/vllm/engine/llm_engine.py", line 137, in _init_workers
    self._run_workers(
  File "/home/vllm/vllm/vllm/engine/llm_engine.py", line 685, in _run_workers
    output = executor(*args, **kwargs)
  File "/home/vllm/vllm/vllm/worker/worker.py", line 67, in init_model
    self.model = get_model(self.model_config)
  File "/home/vllm/vllm/vllm/model_executor/model_loader.py", line 81, in get_model
    raise ValueError(
ValueError: torch.bfloat16 is not supported for quantization method awq. Supported dtypes: [torch.float16]

Firstly: is it expected that AWQ will fail to load as bfloat16? Could that be supported?

Right now the only solution for the user is to download the model and manually edit config.json to set torch_dtype=float16, which is a bit of a pain.

So, secondly: could we get a --dtype float16 option so at least it can be easily avoided with an option? The valid options for --dtype are:'auto', 'half', 'bfloat16', 'float' - there's no way to specify float16 (as I guess it assumes that that's the default.)

I could update config.json in all my AWQ repos, to change anybfloat16 tofloat16 instead, but first it'd be good to know how easy it would be to support bfloat16.

Thanks

WoosukKwon commented 1 year ago

@TheBloke Thanks for reporting the issue.

Currently, AWQ only support FP16 because the AWQ's GEMM kernel only supports it. You can use half instead of float16 for now. We will allow float16 once #1115 is merged. Besides, we are planning to replace AWQ CUDA kernels with more optimized and general implementation. Sorry for the inconvenience.

TheBloke commented 1 year ago

Ahh, so half is the same as float16? For some reason I thought that was int8 (half of float16), my bad!

OK I will add a note to my README that users should add --dtype half to ensure all models work.

Many thanks for the fast reply.

WoosukKwon commented 1 year ago

@TheBloke Yeah, if I understand correctly, half is equal to float16 at least in the context of deep learning.

Screen Shot 2023-09-20 at 2 20 01 PM

I've tried your model https://huggingface.co/TheBloke/Spicyboros-13B-2.2-AWQ with half, and it looks the model is generating valid output sentences. Hope this can be a temporary fix before we improve the kernels.

casper-hansen commented 1 year ago

Besides, we are planning to replace AWQ CUDA kernels with more optimized and general implementation. Sorry for the inconvenience.

@WoosukKwon If you need to create a new format for the INT4 packed weights to optimize throughput, let me know and we can work this into AutoAWQ as a new format to optimize throughput.

INT8 can run faster because it can utilize the INT8 tensor cores, but unfortunately INT4 cannot do the same yet and you need to dequantize to run operations in FP16.

Best of luck with the improved implementation.

sprezz-arthur commented 1 year ago

we are planning to replace AWQ CUDA kernels with more optimized and general implementation.

Is giving support to GPUs with lesser capabilities than 7.5 within the roadmap?

leojames commented 11 months ago

When I use the above method for inference with Codellama, I encounter CUDA kernel errors. Please help me understand why?

WARNING: WatchFiles detected changes in 'fastapi_vllm_codellama.py'. Reloading... INFO 10-31 16:58:55 llm_engine.py:72] Initializing an LLM engine with config: model='./CodeLlama-13B-AWQ', tokenizer='./CodeLlama-13B-AWQ', tokenizer_mode=auto, revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0) INFO 10-31 16:58:55 tokenizer.py:30] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer. Process SpawnProcess-46: Traceback (most recent call last): File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/uvicorn/_subprocess.py", line 76, in subprocess_started target(sockets=sockets) File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/uvicorn/server.py", line 61, in run return asyncio.run(self.serve(sockets=sockets)) File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/uvicorn/server.py", line 68, in serve config.load() File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/uvicorn/config.py", line 473, in load self.loaded_app = import_from_string(self.app) File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/uvicorn/importer.py", line 21, in import_from_string module = importlib.import_module(module_str) File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1050, in _gcd_import File "<frozen importlib._bootstrap>", line 1027, in _find_and_load File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 688, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 883, in exec_module File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed File "/mnt/gpu/code/fastapi_vllm_codellama.py", line 22, in <module> llm = LLM(model="./CodeLlama-13B-AWQ", quantization="awq") File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 89, in __init__ self.llm_engine = LLMEngine.from_engine_args(engine_args) File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 229, in from_engine_args engine = cls(*engine_configs, File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 111, in __init__ self._init_cache() File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 191, in _init_cache num_blocks = self._run_workers( File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 692, in _run_workers output = executor(*args, **kwargs) File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/worker/worker.py", line 109, in profile_num_available_blocks self.model( File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 297, in forward hidden_states = self.model(input_ids, positions, kv_caches, File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 257, in forward hidden_states = layer( File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 216, in forward hidden_states = self.mlp(hidden_states) File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 81, in forward gate_up, _ = self.gate_up_proj(x) File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/model_executor/parallel_utils/tensor_parallel/layers.py", line 238, in forward output_parallel = self.apply_weights(input_parallel, bias) File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/model_executor/layers/quantized_linear/awq.py", line 55, in apply_weights out = quantization_ops.awq_gemm(reshaped_x, self.qweight, self.scales, RuntimeError: CUDA error: an illegal memory access was encountered Compile withTORCH_USE_CUDA_DSAto enable device-side assertions.

ParisNeo commented 9 months ago

ValueError: No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine. I just try to load a simple gptq 7B model and I have a GPU with 12G of VRAM. I don't understand why it is doing this!

hmellor commented 6 months ago

Closing because --dtype does now support specifying float16

jradikk commented 4 weeks ago

ValueError: No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine. I just try to load a simple gptq 7B model and I have a GPU with 12G of VRAM. I don't understand why it is doing this!

I seem to have the same problem trying to run this model https://huggingface.co/hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 with dtype=float16 and ray cluster