kvcache-ai / ktransformers

A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
Apache License 2.0
745 stars 39 forks source link

Getting reasonable performance on dual RTX 3090 and 128gb #85

Open trilog-inc opened 2 months ago

trilog-inc commented 2 months ago

Hi,

First off thanks for all the work you guys have put into this.

I am trying to run DeepSeek-Coder-V2-Instruct-0724-GGUF Q4_K_M with reasonable performance but cannot figure it out. When i use the default configuration of the "DeepSeek-V2-Chat-multi-gpu.yaml" optimize file, I get about 0.7 t/s. I have tried to load some of the expert layers to the cuda:0 and cuda:1 but hit OOM errors when more than 1 layer is used. Example Yaml match:

- match:
    name: "^model\\.layers\\.(0|[1])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts     # custom MoE Kernel with expert paralleism
    kwargs:
      prefill_device: "cuda:0"
      prefill_op: "KExpertsTorch"
      generate_device: "cuda:0"
      generate_op:  "KExpertsTorch" # do remember using correct backend, KExpertsCPU only runable on cpu.
      out_device: "cuda:0"
  recursive: False # don't recursively inject submodules of this module
  - match:
    name: "^model\\.layers\\.(0|[2-9]|[12][0-9])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts     # custom MoE Kernel with expert paralleism
    kwargs:
      prefill_device: "cuda:0"
      prefill_op: "KExpertsTorch"
      generate_device: "cpu"
      generate_op:  "KExpertsCPU"
      out_device: "cuda:0"
  recursive: False # don't recursively inject submodules of this module

GPU Usage:


+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     71956      C   ...onda3/envs/ktransformers/bin/python       5156MiB |
|    1   N/A  N/A     71956      C   ...onda3/envs/ktransformers/bin/python       6864MiB |
+-----------------------------------------------------------------------------------------+

Has any one been able to achieve reasonable results with this sort of setup?

System: 13th Gen Intel(R) Core(TM) i5-13600K 128GB DDR4 3200 ( 4 x 32GB ) 2x RTX 3090

Azure-Tang commented 2 months ago

Hi, thanks for your interest about ktransformers.

Deepseekv2's Q4-km requires 136G RAM, the data will frequently swap in and out in your RAM if you only got 128G, which slashed your generate speed. My advise is increase your ram or use IQ4_XS format model (125G).

trilog-inc commented 2 months ago

Hi Azure, thanks for the reply.

Unfortunately I am using a consumer motherboard on this setup and the ram is maxed at 128GB.

However, I tried the IQ4_XS format with the no optimize config and the results are better.

prompt eval count: 26 token(s) prompt eval duration: 1.7585856914520264s prompt eval rate: 14.784607953071857 tokens/s eval count: 921 token(s) eval duration: 135.37466645240784s eval rate: 6.803340862330292 tokens/s

When i try to load it with the default DeepSeek-V2-Chat-multi-gpu.yaml, I get the following CUDA error as it starts to load into the second GPU

...
loading blk.29.ffn_norm.weight to cuda:0
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/mnt/data/myfrienderic/ktransformers/ktransformers/local_chat.py", line 159, in <module>
    fire.Fire(local_chat)
  File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/data/myfrienderic/ktransformers/ktransformers/local_chat.py", line 106, in local_chat
    optimize_and_load_gguf(model, optimize_rule_path, gguf_path, config)
  File "/mnt/data/myfrienderic/ktransformers/ktransformers/optimize/optimize.py", line 129, in optimize_and_load_gguf
    load_weights(module, gguf_loader)
  File "/mnt/data/myfrienderic/ktransformers/ktransformers/util/utils.py", line 83, in load_weights
    load_weights(child, gguf_loader, prefix+name+".")
  File "/mnt/data/myfrienderic/ktransformers/ktransformers/util/utils.py", line 85, in load_weights
    module.load()
  File "/mnt/data/myfrienderic/ktransformers/ktransformers/operators/base_operator.py", line 60, in load
    utils.load_weights(child, self.gguf_loader, self.key+".")
  File "/mnt/data/myfrienderic/ktransformers/ktransformers/util/utils.py", line 83, in load_weights
    load_weights(child, gguf_loader, prefix+name+".")
  File "/mnt/data/myfrienderic/ktransformers/ktransformers/util/utils.py", line 83, in load_weights
    load_weights(child, gguf_loader, prefix+name+".")
  File "/mnt/data/myfrienderic/ktransformers/ktransformers/util/utils.py", line 83, in load_weights
    load_weights(child, gguf_loader, prefix+name+".")
  File "/mnt/data/myfrienderic/ktransformers/ktransformers/util/utils.py", line 85, in load_weights
    module.load()
  File "/mnt/data/myfrienderic/ktransformers/ktransformers/operators/base_operator.py", line 60, in load
    utils.load_weights(child, self.gguf_loader, self.key+".")
  File "/mnt/data/myfrienderic/ktransformers/ktransformers/util/utils.py", line 83, in load_weights
    load_weights(child, gguf_loader, prefix+name+".")
  File "/mnt/data/myfrienderic/ktransformers/ktransformers/util/utils.py", line 85, in load_weights
    module.load()
  File "/mnt/data/myfrienderic/ktransformers/ktransformers/operators/linear.py", line 422, in load
    self.generate_linear.load(w=w)
  File "/mnt/data/myfrienderic/ktransformers/ktransformers/operators/linear.py", line 207, in load
    w_ref, marlin_q_w, marlin_s, g_idx, sort_indices, _ = marlin_quantize(
                                                          ^^^^^^^^^^^^^^^^
  File "/mnt/data/myfrienderic/ktransformers/ktransformers/ktransformers_ext/operators/custom_marlin/quantize/utils/marlin_utils.py", line 93, in marlin_quantize
    w_ref, q_w, s, g_idx, rand_perm = quantize_weights(w, num_bits, group_size,
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/data/myfrienderic/ktransformers/ktransformers/ktransformers_ext/operators/custom_marlin/quantize/utils/quant_utils.py", line 61, in quantize_weights
    w = w.reshape((group_size, -1))
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Loading the Q4_KM with the same config completes correctly, but suffers from the aforementioned bad performance.

Would it be possible to eventually leverage the extra 24GB of VRAM ( + 12GB unused on the first GPU ) to load a larger model than the system ram can handle? As in is there a way to configure the optimize config to offload more of the model on the GPU to compensate

Azure-Tang commented 2 months ago

This is a bug, I just fixed it.

About your problem.

Would it be possible to eventually leverage the extra 24GB of VRAM ( + 12GB unused on the first GPU ) to load a larger model than the system ram can handle? As in is there a way to configure the optimize config to offload more of the model on the GPU to compensate

Maybe you can consider modify your yaml, offload some of experts from CPU to GPU to utilize your extra VRAM. You can find detailed tutorial here.

trilog-inc commented 2 months ago

Thanks for the update!

I will test this throughout the weekend.

Do you have an intuition on which parameters i should try to load first? I tried with the "ktransformers.operators.experts.KTransformersExperts" class but triggered an OOM on 1 layer .. Not sure where to go next and would love your input.

Azure-Tang commented 2 months ago

Which backend you are using for ktransformers.operators.experts.KTransformersExperts?

trilog-inc commented 2 months ago

Using the following yaml modification to the yaml

- match:
    name: "^model\\.layers\\.(0|1)\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts     # custom MoE Kernel with expert paralleism
    kwargs:
      prefill_device: "cuda:0"
      prefill_op: "KExpertsMarlin"
      generate_device: "cuda:0"
      generate_op:  "KExpertsTorch" # do remember using correct backend, KExpertsCPU only runable on cpu.
      out_device: "cuda:0"
  recursive: False # don't recursively inject submodules of this module

- match:
    name: "^model\\.layers\\.([2-9]|[12][0-9])\\.mlp\\.experts$"
  replace:
    class: ktransformers.operators.experts.KTransformersExperts     # custom MoE Kernel with expert paralleism
    kwargs:
      prefill_device: "cuda:0"
      prefill_op: "KExpertsTorch"
      generate_device: "cpu"
      generate_op:  "KExpertsCPU"
      out_device: "cuda:0"
  recursive: False # don't recursively inject submodules of this module

If I use the Marlin Backend, The VRAM usage on the first GPU hits ~22GB usage during loading then settles down to ~12GB after loading.

During Loading:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:01:00.0 Off |                  N/A |
| 61%   55C    P2            176W /  370W |   22343MiB /  24576MiB |     42%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  |   00000000:06:00.0 Off |                  N/A |
|  0%   52C    P8             19W /  420W |       3MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     10202      C   ...onda3/envs/ktransformers/bin/python      22334MiB |
+-----------------------------------------------------------------------------------------+

After Loading:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:01:00.0 Off |                  N/A |
|  0%   44C    P8             48W /  370W |   12877MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  |   00000000:06:00.0 Off |                  N/A |
|  0%   54C    P8             18W /  420W |    7059MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     10202      C   ...onda3/envs/ktransformers/bin/python      12868MiB |
|    1   N/A  N/A     10202      C   ...onda3/envs/ktransformers/bin/python       7050MiB |
+-----------------------------------------------------------------------------------------+

When I try to generate anything with the web UI, I get the following error in the command line:

/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/contextlib.py:105: FutureWarning: `torch.backends.cuda.sdp_kernel()` is deprecated. In the future, this context manager will be removed. Please see `torch.nn.attention.sdpa_kernel()` for the new context manager, with updated signature.
  self.gen = func(*args, **kwds)
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/responses.py", line 257, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/responses.py", line 253, in wrap
    await func()
  File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/responses.py", line 230, in listen_for_disconnect
    message = await receive()
              ^^^^^^^^^^^^^^^
  File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/uvicorn/protocols/http/h11_impl.py", line 534, in receive
    await self.message_event.wait()
  File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/asyncio/locks.py", line 213, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f3b31cbcd50

During handling of the above exception, another exception occurred:

  + Exception Group Traceback (most recent call last):
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/uvicorn/protocols/http/h11_impl.py", line 406, in run_asgi
  |     result = await app(  # type: ignore[func-returns-value]
  |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
  |     return await self.app(scope, receive, send)
  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
  |     await super().__call__(scope, receive, send)
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/applications.py", line 113, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/middleware/errors.py", line 187, in __call__
  |     raise exc
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/middleware/errors.py", line 165, in __call__
  |     await self.app(scope, receive, _send)
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/middleware/cors.py", line 93, in __call__
  |     await self.simple_response(scope, receive, send, request_headers=headers)
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/middleware/cors.py", line 144, in simple_response
  |     await self.app(scope, receive, send)
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
  |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
  |     raise exc
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/routing.py", line 715, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/routing.py", line 735, in app
  |     await route.handle(scope, receive, send)
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/routing.py", line 288, in handle
  |     await self.app(scope, receive, send)
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/routing.py", line 76, in app
  |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
  |     raise exc
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/routing.py", line 74, in app
  |     await response(scope, receive, send)
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/responses.py", line 250, in __call__
  |     async with anyio.create_task_group() as task_group:
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 680, in __aexit__
  |     raise BaseExceptionGroup(
  | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/util/cuda_graph_runner.py", line 41, in capture
    |     logits=model(inputs_embeds=inputs_embeds,
    |           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    |     return forward_call(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/models/modeling_deepseek.py", line 1731, in forward
    |     outputs = self.model(
    |               ^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    |     return forward_call(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/operators/models.py", line 719, in forward
    |     layer_outputs = decoder_layer(
    |                     ^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    |     return forward_call(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/models/modeling_deepseek.py", line 1254, in forward
    |     hidden_states = self.mlp(hidden_states)
    |                     ^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    |     return forward_call(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/operators/experts.py", line 652, in forward
    |     y = self.moe_on_cpuinfer(hidden_states, topk_idx, topk_weight).view(*orig_shape).to(device=hidden_states.device)
    |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    |     return func(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/operators/experts.py", line 674, in moe_on_cpuinfer
    |     outs = self.experts(x, topk_ids, topk_weight)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    |     return forward_call(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/operators/experts.py", line 503, in forward
    |     return self.generate_experts.forward(input_tensor, expert_ids, weights)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/operators/experts.py", line 424, in forward
    |     idx, top_x = torch.where(expert_mask[expert_idx])
    |                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    | RuntimeError: CUDA error: operation not permitted when stream is capturing
    | CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
    | For debugging consider passing CUDA_LAUNCH_BLOCKING=1
    | Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
    | 
    | 
    | During handling of the above exception, another exception occurred:
    | 
    | Traceback (most recent call last):
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/responses.py", line 253, in wrap
    |     await func()
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/responses.py", line 242, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/schemas/assistants/streaming.py", line 80, in check_client_link
    |     async for event in async_events:
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/schemas/assistants/streaming.py", line 93, in to_stream_reply
    |     async for event in async_events:
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/schemas/assistants/streaming.py", line 87, in add_done
    |     async for event in async_events:
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/schemas/assistants/streaming.py", line 101, in filter_api_event
    |     async for event in async_events:
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/api/openai/assistants/runs.py", line 28, in inner
    |     async for event in ctx.work():
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/backend/base.py", line 145, in work
    |     async for token in self.interface.inference(local_messages,self.thread.id):
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/transformers.py", line 330, in inference
    |     for t in self.generate():
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 36, in generator_context
    |     response = gen.send(None)
    |                ^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/transformers.py", line 290, in generate
    |     next_token = self.decode_one_tokens()
    |                  ^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/ktransformers.py", line 58, in decode_one_tokens
    |     self.cuda_graph_runner.capture(self.model, self.current_ids, self.active_cache_position.unsqueeze(0), self.active_cache_position, self.cache, main_device=torch_device, return_dict=False, use_cache=True)
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/util/cuda_graph_runner.py", line 40, in capture
    |     with torch.cuda.graph(self.graph, stream = capture_stream):
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/cuda/graphs.py", line 185, in __exit__
    |     self.cuda_graph.capture_end()
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/cuda/graphs.py", line 83, in capture_end
    |     super().capture_end()
    | RuntimeError: CUDA error: operation failed due to a previous error during capture
    | CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
    | For debugging consider passing CUDA_LAUNCH_BLOCKING=1
    | Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
    | 
    +------------------------------------

The same Error occurs if I load it with the Torch expert:

During handling of the above exception, another exception occurred:

 + Exception Group Traceback (most recent call last):
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/uvicorn/protocols/http/h11_impl.py", line 406, in run_asgi
  |     result = await app(  # type: ignore[func-returns-value]
  |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
  |     return await self.app(scope, receive, send)
  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
  |     await super().__call__(scope, receive, send)
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/applications.py", line 113, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/middleware/errors.py", line 187, in __call__
  |     raise exc
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/middleware/errors.py", line 165, in __call__
  |     await self.app(scope, receive, _send)
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/middleware/cors.py", line 93, in __call__
  |     await self.simple_response(scope, receive, send, request_headers=headers)
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/middleware/cors.py", line 144, in simple_response
  |     await self.app(scope, receive, send)
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
  |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
  |     raise exc
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/routing.py", line 715, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/routing.py", line 735, in app
  |     await route.handle(scope, receive, send)
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/routing.py", line 288, in handle
  |     await self.app(scope, receive, send)
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/routing.py", line 76, in app
  |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
  |     raise exc
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/routing.py", line 74, in app
  |     await response(scope, receive, send)
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/responses.py", line 250, in __call__
  |     async with anyio.create_task_group() as task_group:
  |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 680, in __aexit__
  |     raise BaseExceptionGroup(
  | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/util/cuda_graph_runner.py", line 41, in capture
    |     logits=model(inputs_embeds=inputs_embeds,
    |           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    |     return forward_call(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/models/modeling_deepseek.py", line 1731, in forward
    |     outputs = self.model(
    |               ^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    |     return forward_call(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/operators/models.py", line 719, in forward
    |     layer_outputs = decoder_layer(
    |                     ^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    |     return forward_call(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/models/modeling_deepseek.py", line 1254, in forward
    |     hidden_states = self.mlp(hidden_states)
    |                     ^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    |     return forward_call(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/operators/experts.py", line 652, in forward
    |     y = self.moe_on_cpuinfer(hidden_states, topk_idx, topk_weight).view(*orig_shape).to(device=hidden_states.device)
    |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    |     return func(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/operators/experts.py", line 674, in moe_on_cpuinfer
    |     outs = self.experts(x, topk_ids, topk_weight)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    |     return forward_call(*args, **kwargs)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/operators/experts.py", line 503, in forward
    |     return self.generate_experts.forward(input_tensor, expert_ids, weights)
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/operators/experts.py", line 424, in forward
    |     idx, top_x = torch.where(expert_mask[expert_idx])
    |                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    | RuntimeError: CUDA error: operation not permitted when stream is capturing
    | CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
    | For debugging consider passing CUDA_LAUNCH_BLOCKING=1
    | Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
    | 
    | 
    | During handling of the above exception, another exception occurred:
    | 
    | Traceback (most recent call last):
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/responses.py", line 253, in wrap
    |     await func()
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/starlette/responses.py", line 242, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/schemas/assistants/streaming.py", line 80, in check_client_link
    |     async for event in async_events:
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/schemas/assistants/streaming.py", line 93, in to_stream_reply
    |     async for event in async_events:
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/schemas/assistants/streaming.py", line 87, in add_done
    |     async for event in async_events:
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/schemas/assistants/streaming.py", line 101, in filter_api_event
    |     async for event in async_events:
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/api/openai/assistants/runs.py", line 28, in inner
    |     async for event in ctx.work():
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/backend/base.py", line 145, in work
    |     async for token in self.interface.inference(local_messages,self.thread.id):
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/transformers.py", line 330, in inference
    |     for t in self.generate():
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 36, in generator_context
    |     response = gen.send(None)
    |                ^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/transformers.py", line 290, in generate
    |     next_token = self.decode_one_tokens()
    |                  ^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/server/backend/interfaces/ktransformers.py", line 58, in decode_one_tokens
    |     self.cuda_graph_runner.capture(self.model, self.current_ids, self.active_cache_position.unsqueeze(0), self.active_cache_position, self.cache, main_device=torch_device, return_dict=False, use_cache=True)
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/ktransformers/util/cuda_graph_runner.py", line 40, in capture
    |     with torch.cuda.graph(self.graph, stream = capture_stream):
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/cuda/graphs.py", line 185, in __exit__
    |     self.cuda_graph.capture_end()
    |   File "/home/myfrienderic/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/cuda/graphs.py", line 83, in capture_end
    |     super().capture_end()
    | RuntimeError: CUDA error: operation failed due to a previous error during capture
    | CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
    | For debugging consider passing CUDA_LAUNCH_BLOCKING=1
    | Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
    | 
    +------------------------------------

Any ideas on how to debug this?

Said-Akbar commented 2 months ago

+1 to this. I am also impacted. I have RTX 3090. When I try to use 0 and 1st layers of experts with
prefill_op: "KExpertsMarlin" and generate_op: "KExpertsTorch", VRAM fills out to ~17.5GB and it loads the model fine but when I submit a prompt in the UI, I get the error that @myfrienderic shared above.

@Azure-Tang Please, let us know if there is a fix. Thanks!