sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
5.89k stars 477 forks source link

Expose max_total_num_tokens for Token Limit Calculation in Request Handling #1900

Open hahmad2008 opened 1 day ago

hahmad2008 commented 1 day ago

Checklist

Describe the bug

I need to expose the value self.max_total_num_tokens = self.profile_max_num_token(total_gpu_memory) from model_runner.py, specifically at this line: https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/model_executor/model_runner.py#L426-L427

This is necessary for theruntime.add_request(prompt, sampling_params)function, as I need to determine how many tokens SGLang can process per request based on the current configuration. The effective token limit is the minimum of this value and the maximum context length I provide when configuring the runtime class.

This is used in the following line: self.max_req_input_len = min(self.model_config.context_len - 1, self.max_total_num_tokens - 1), which can be found here: https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/tp_worker.py#L93

In this way I know what is the max length can sglang handle to prevent the truncation from sglang side

Reproduction

-

Environment

sglang: 0.3.3.post1

hahmad2008 commented 10 hours ago

@yileld sorry for tagging. Any idea?