bug: tensorrt-llm 500 err - maxTokensInPagedKvCache (128) must be large enough

gabrielle-ong commented 2 months ago

Describe the bug cortex run mistral:tensorrt-llm shows loads model, but running chat returns 500 error

To Reproduce

cortex run mistral:tensorrt-llm
cortex run mistral:tensorrt-llm --chat

Expected behavior chat window should appear

Screenshots

Screenshot 2024-08-06 173206

Desktop (please complete the following information):

Windows 11, AMD Ryzen 9 + NVIDIA RTX 4070

Error log cortex.log


[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +0, now: CPU 18096, GPU 8187 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 7892 (MiB)
[TensorRT-LLM][INFO] Max tokens in paged KV cache: 128. Allocating 16777216 bytes.
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 16
20240806 09:33:16.023000 UTC 10952 ERROR Unhandled exception in /inferences/server/loadmodel, what(): [TensorRT-LLM][ERROR] Assertion failed: maxTokensInPagedKvCache (128) must be large enough to process at least 1 sequence to completion (i.e. must be larger than beam_width (1) * tokensPerBlock (128) * maxBlocksPerSeq (16)) (C:\Users\tejaswinp\workspace\tekit\cpp\tensorrt_llm\batch_manager\kvCacheManager.cpp:754) - HttpAppFrameworkImpl.cc:124
× Model loading failed
{"method":"POST","path":"/v1/models/mistral:tensorrt-llm/start","statusCode":500,"ip":"127.0.0.1","content_length":"52","user_agent":"CortexClient/JS 0.1.6","x_correlation_id":""} HTTP```

vansangpfiev commented 2 months ago

[TensorRT-LLM][INFO] Engine version 0.9.0 found in the config file, assuming engine(s) built by new builder API.

Hi @gabrielle-ong, I think you are running old version, on Windows we are now supporting trt-engine version 0.10.0. Can you please init cortex.tensorrt-llm engine, update models and try again?

0xSage commented 1 month ago

Stale, closing. @gabrielle-ong do try to verify and reopen if still an issue on latest version

janhq / cortex.cpp

bug: tensorrt-llm 500 err - maxTokensInPagedKvCache (128) must be large enough #981