[X] 1. I have searched related issues but cannot get the expected help.
[ ] 2. The bug has not been fixed in the latest version.
[ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
This is necessary for theruntime.add_request(prompt, sampling_params)function, as I need to determine how many tokens SGLang can process per request based on the current configuration. The effective token limit is the minimum of this value and the maximum context length I provide when configuring the runtime class.
Checklist
Describe the bug
I need to expose the value
self.max_total_num_tokens = self.profile_max_num_token(total_gpu_memory)
from model_runner.py, specifically at this line: https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/model_executor/model_runner.py#L426-L427This is necessary for the
runtime.add_request(prompt, sampling_params)
function, as I need to determine how many tokens SGLang can process per request based on the current configuration. The effective token limit is the minimum of this value and the maximum context length I provide when configuring the runtime class.This is used in the following line:
self.max_req_input_len = min(self.model_config.context_len - 1, self.max_total_num_tokens - 1)
, which can be found here: https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/tp_worker.py#L93In this way I know what is the max length can sglang handle to prevent the truncation from sglang side
Reproduction
-
Environment
sglang: 0.3.3.post1