kvcache-ai / ktransformers

A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
Apache License 2.0
741 stars 39 forks source link

Error Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. #96

Open drrros opened 1 month ago

drrros commented 1 month ago

I'm trying to run a DeepSeek-V2.5 model. Command used: python -m ktransformers.local_chat --model_path ./DeepSeek-V2.5/ --gguf_path ../

Chat: hi
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/drros/deepseek2.5/ktransformers/ktransformers/local_chat.py", line 159, in <module>
    fire.Fire(local_chat)
  File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drros/deepseek2.5/ktransformers/ktransformers/local_chat.py", line 153, in local_chat
    generated = prefill_and_generate(
                ^^^^^^^^^^^^^^^^^^^^^
  File "/home/drros/deepseek2.5/ktransformers/ktransformers/util/utils.py", line 150, in prefill_and_generate
    logits = model(
             ^^^^^^
  File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drros/deepseek2.5/ktransformers/ktransformers/models/modeling_deepseek.py", line 1731, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drros/deepseek2.5/ktransformers/ktransformers/operators/models.py", line 719, in forward
    layer_outputs = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drros/deepseek2.5/ktransformers/ktransformers/models/modeling_deepseek.py", line 1238, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
                                                          ^^^^^^^^^^^^^^^
  File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drros/deepseek2.5/ktransformers/ktransformers/operators/attention.py", line 170, in forward
    return self.forward_chunck(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/drros/deepseek2.5/ktransformers/ktransformers/operators/attention.py", line 71, in forward_chunck
    q = self.q_b_proj(self.q_a_layernorm(self.q_a_proj(hidden_states)))
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/drros/deepseek2.5/ktransformers/ktransformers/models/modeling_deepseek.py", line 113, in forward
    hidden_states = hidden_states.to(torch.float32)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: invalid device function
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I've tried with both ktransformers.local_chat and web interface mode, with and without --optimize_config_path option. Model loads but when first iteraction occurs it fails with this traceback. Other backends (koboldcpp and llama.cpp) runs fine. Server specs: 200Gb ram, dual P40.

drrros commented 1 month ago

This is full logs in API mode:

COMPLETION INPUT:----
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a world-class AI system, capable of complex reasoning and reflection. Reason through the query inside <thinking> tags, and then provide your final response inside <output> tags. If you detect that you made a mistake in your reasoning at any point, correct yourself inside <reflection> tags.<|eot_id|><|start_header_id|>user<|end_header_id|>

{user_prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{assistant_response}<|eot_id|><|start_header_id|>user<|end_header_id|>

{next_user_prompt}<|eot_id|>
***
DrRos: text
DrRos: test
Assistant:
----
INFO:     192.168.0.77:58360 - "POST /v1/completions HTTP/1.1" 200 OK
2024-10-02 19:29:03,525 DEBUG /home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/transformers.py[240]: input_ids: torch.Size([1, 202])
2024-10-02 19:29:03,529 DEBUG /home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/transformers.py[262]: cache position: 0 to 202
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/responses.py", line 257, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/responses.py", line 253, in wrap
    await func()
  File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/responses.py", line 230, in listen_for_disconnect
    message = await receive()
              ^^^^^^^^^^^^^^^
  File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/uvicorn/protocols/http/h11_impl.py", line 534, in receive
    await self.message_event.wait()
  File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/asyncio/locks.py", line 213, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f0f919ca950

During handling of the above exception, another exception occurred:

  + Exception Group Traceback (most recent call last):
  |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/uvicorn/protocols/http/h11_impl.py", line 406, in run_asgi
  |     result = await app(  # type: ignore[func-returns-value]
  |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
  |     return await self.app(scope, receive, send)
  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
  |     await super().__call__(scope, receive, send)
  |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/applications.py", line 113, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/middleware/errors.py", line 187, in __call__
  |     raise exc
  |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/middleware/errors.py", line 165, in __call__
  |     await self.app(scope, receive, _send)
  |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/middleware/cors.py", line 85, in __call__
  |     await self.app(scope, receive, send)
  |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
  |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
  |     raise exc
  |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/routing.py", line 715, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/routing.py", line 735, in app
  |     await route.handle(scope, receive, send)
  |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/routing.py", line 288, in handle
  |     await self.app(scope, receive, send)
  |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/routing.py", line 76, in app
  |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
  |     raise exc
  |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/routing.py", line 74, in app
  |     await response(scope, receive, send)
  |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/responses.py", line 250, in __call__
  |     async with anyio.create_task_group() as task_group:
  |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 736, in __aexit__
  |     raise BaseExceptionGroup(
  | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/responses.py", line 253, in wrap
    |     await func()
    |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/responses.py", line 242, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/schemas/assistants/streaming.py", line 80, in check_client_link
    |     async for event in async_events:
    |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/schemas/assistants/streaming.py", line 93, in to_stream_reply
    |     async for event in async_events:
    |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/schemas/assistants/streaming.py", line 87, in add_done
    |     async for event in async_events:
    |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/api/openai/legacy/completions.py", line 23, in inner
    |     async for token in interface.inference(create.prompt,id):
    |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/transformers.py", line 323, in inference
    |     for t in self.prefill(input_ids,self.check_is_new(thread_id)):
    |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 36, in generator_context
    |     response = gen.send(None)
    |                ^^^^^^^^^^^^^^
    |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/transformers.py", line 272, in prefill
    |     logits = self.model(
    |              ^^^^^^^^^^^
    |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    |     return forward_call(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/models/modeling_deepseek.py", line 1731, in forward
    |     outputs = self.model(
    |               ^^^^^^^^^^^
    |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    |     return forward_call(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/operators/models.py", line 719, in forward
    |     layer_outputs = decoder_layer(
    |                     ^^^^^^^^^^^^^^
    |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    |     return forward_call(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/models/modeling_deepseek.py", line 1238, in forward
    |     hidden_states, self_attn_weights, present_key_value = self.self_attn(
    |                                                           ^^^^^^^^^^^^^^^
    |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    |     return forward_call(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/operators/attention.py", line 170, in forward
    |     return self.forward_chunck(
    |            ^^^^^^^^^^^^^^^^^^^^
    |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/operators/attention.py", line 71, in forward_chunck
    |     q = self.q_b_proj(self.q_a_layernorm(self.q_a_proj(hidden_states)))
    |                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    |     return forward_call(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/drros/miniconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/models/modeling_deepseek.py", line 113, in forward
    |     hidden_states = hidden_states.to(torch.float32)
    |                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    | RuntimeError: CUDA error: invalid device function
    | CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
    | For debugging consider passing CUDA_LAUNCH_BLOCKING=1
    | Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
    | 
    +------------------------------------

versions:

(base) drros@tesla-ubuntu:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
(base) drros@tesla-ubuntu:~$ uname -a
Linux tesla-ubuntu 5.15.0-122-generic #132-Ubuntu SMP Thu Aug 29 13:45:52 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Chain-Mao commented 1 month ago

I have the same problem, did you have solved it?

drrros commented 1 month ago

I have the same problem, did you have solved it?

No. Tried different versions of torch, but it did not solve the issue. What GPUs are you using? Mine are P40's.

Chain-Mao commented 1 month ago

I have the same problem, did you have solved it?

No. Tried different versions of torch, but it did not solve the issue. What GPUs are you using? Mine are P40's.

My gpu is V100, maybe I find the reason, the V100 device use vlota architecture,this problem lead to I can't use flash-atten normally,so most of this project is incompatible.

UnicornChan commented 1 month ago

I have the same problem, did you have solved it?

No. Tried different versions of torch, but it did not solve the issue. What GPUs are you using? Mine are P40's.

Sorry, we use the Marlin operator to calculate the layer on the GPU. It requires Compute Capability 8.0 or above to run. However, the Compute Capability of P40 is 6.1, so it cannot run. Maybe you can run it by removing the Marlin operator. @Azure-Tang Maybe you can show how to remove Marlin?