Bug: INTERNAL ASSERT FAILED

SinanAkkoyun commented 9 months ago

Model: TheBloke/Mistral-7B-OpenOrca-AWQ (and any other Mistral AWQ model of them) Cuda: 12.2

WARNING 12-03 17:13:44 config.py:398] Casting torch.bfloat16 to torch.float16.
WARNING 12-03 17:13:44 config.py:140] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 12-03 17:13:44 llm_engine.py:72] Initializing an LLM engine with config: model='TheBloke/Mistral-7B-OpenOrca-AWQ', tokenizer='TheBloke/Mistral-7B-OpenOrca-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
  File "/home/ai/ml/llm/inference/vllm/awq_test.py", line 37, in <module>
    llm = LLM(model="TheBloke/Mistral-7B-OpenOrca-AWQ", quantization="AWQ", trust_remote_code=True, dtype="half")
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 93, in __init__
    self.llm_engine = LLMEngine.from_engine_args(engine_args)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 231, in from_engine_args
    engine = cls(*engine_configs,
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 113, in __init__
    self._init_cache()
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 193, in _init_cache
    num_blocks = self._run_workers(
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 700, in _run_workers
    output = executor(*args, **kwargs)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 111, in profile_num_available_blocks
    self.model(
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/mistral.py", line 290, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/mistral.py", line 256, in forward
    hidden_states, residual = layer(
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/mistral.py", line 214, in forward
    hidden_states = self.mlp(hidden_states)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/mistral.py", line 78, in forward
    gate_up, _ = self.gate_up_proj(x)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 203, in forward
    output_parallel = self.linear_method.apply_weights(
  File "/home/ai/.mconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/quantization/awq.py", line 154, in apply_weights
    out = quantization_ops.awq_gemm(reshaped_x, qweight, scales, qzeros,
RuntimeError: handle_0 INTERNAL ASSERT FAILED at "../c10/cuda/driver_api.cpp":15, please report a bug to PyTorch.

HelloCard commented 9 months ago

same err, use 2080ti 22G *2, use python==3.10 or python==3.8, use python==3.8 and cuda toolkit==12.1 compile, all in wsl2, all have same problam.

HelloCard commented 9 months ago

(llm) root@DESKTOP-1CSPSTT:~/vllm-main# python -m vllm.entrypoints.api_server --model /mnt/e/Code/text-generation-webui/models/orca-2-13B-AWQ --trust-remote-code --quantization awq WARNING 12-04 08:31:50 config.py:140] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models. INFO 12-04 08:31:50 llm_engine.py:73] Initializing an LLM engine with config: model='/mnt/e/Code/text-generation-webui/models/orca-2-13B-AWQ', tokenizer='/mnt/e/Code/text-generation-webui/models/orca-2-13B-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0) INFO 12-04 08:36:01 llm_engine.py:218] # GPU blocks: 898, # CPU blocks: 327 Traceback (most recent call last): File "/root/miniconda3/envs/llm/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/root/miniconda3/envs/llm/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/root/vllm-main/vllm/entrypoints/api_server.py", line 80, in engine = AsyncLLMEngine.from_engine_args(engine_args) File "/root/vllm-main/vllm/engine/async_llm_engine.py", line 486, in from_engine_args engine = cls(parallel_config.worker_use_ray, File "/root/vllm-main/vllm/engine/async_llm_engine.py", line 269, in init self.engine = self._init_engine(*args, kwargs) File "/root/vllm-main/vllm/engine/async_llm_engine.py", line 305, in _init_engine return engine_class(*args, *kwargs) File "/root/vllm-main/vllm/engine/llm_engine.py", line 112, in init self._init_cache() File "/root/vllm-main/vllm/engine/llm_engine.py", line 230, in _init_cache self._run_workers("init_cache_engine", cache_config=self.cache_config) File "/root/vllm-main/vllm/engine/llm_engine.py", line 746, in _run_workers self._run_workers_in_batch(workers, method, args, kwargs)) File "/root/vllm-main/vllm/engine/llm_engine.py", line 720, in _run_workers_in_batch output = executor(*args, **kwargs) File "/root/vllm-main/vllm/worker/worker.py", line 112, in init_cache_engine self.cache_engine = CacheEngine(self.cache_config, self.model_config, File "/root/vllm-main/vllm/worker/cache_engine.py", line 44, in init self.gpu_cache = self.allocate_gpu_cache() File "/root/vllm-main/vllm/worker/cache_engine.py", line 80, in allocate_gpu_cache value_blocks = torch.empty( RuntimeError: handle_0 INTERNAL ASSERT FAILED at "../c10/cuda/driver_api.cpp":15, please report a bug to PyTorch.

SinanAkkoyun commented 9 months ago

@HelloCard

llm = LLM(model="TheBloke/Mistral-7B-OpenOrca-AWQ", quantization="AWQ", trust_remote_code=True, dtype="half", max_model_len=16384)

This did it for me, the max_model_len! Tell me if it works for you too and I'll close the issue

HelloCard commented 9 months ago

@SinanAkkoyun thank you, god bless you! it solve my problam.

guankaisi commented 8 months ago

I have met the same question but setting max_len doesn't work .

vllm-project / vllm

Bug: INTERNAL ASSERT FAILED #1905