Open jeffrey-fong opened 2 weeks ago
@yichuan520030910320 may help take a look
For your first question: May I know what is the reason for having prompt_len - 2 instead of prompt_len? It is because we use a conservative way to calculate, First, -1 is needed because the previous token's information is required to calculate the logprobs for the next token. Second, -2 prevents two pieces of text from sometimes merging into a single token after tokenization. -2 is always right because, in the worst-case scenario, it only recalculates some logprobs, and it can ensure that certain corner cases are handled correctly.
For your second question, I think you might be right. If the chat template is added at the same time and then encoded, it's necessary to avoid the repeated addition of the BOS/EOS token.
Finally, you're welcome to contribute and submit a PR. If there are any issues, we can continue the discussion here or in the PR you provide.
Thanks for the answers. For the first question, may I know if -2 is meant to address tokenizers based on SentencePiece? I'm currently still getting the extra preceding token in input_token_logprobs
here. The preceding token belongs to the end of the prompt and it's logprob value is being considered regardless of which ChoicesSamplingMethod
that I use. The problem is gone once I change to prompt_len - 1
. I am using a tokenizer based on Tiktoken.
Hi, any updates on this? I tried with the llama3.1-instruct-8b model as well and the input_token_logprobs also has the extra preceding token which should not be there.
Checklist
Describe the bug
RuntimeEndpoint.select
method seems to not be slicing the correct parts of the prompt and providing the incorrect subsequence ofinput_token_logprobs
for thechoices_method
class. Specifically, in this line, havingprompt_len - 2
forlogprob_start_len
resulted in 1 preceding token included in theinput_token_logprobs
provided to thechoices_method
. This is detrimental for allChoicesSamplingMethod
subclasses, specifically forGreedyTokenSelection
as it selects greedily based on the first logprob in the subsequence. May I know what is the reason for havingprompt_len - 2
instead ofprompt_len
? I'm curious whether this is indeed a bug or I just encountered an edge case.Additionally, this line in both
TokenizerManager._handle_single_request
andTokenizerManager._handle_batch_request
always adds a bos_token to the prompt. This may result in double bos tokens as some models like llama3.1 and MeetKai's Functionary models automatically adds a bos_token in the Jinja chat template.llama3.1 functionary-small-v3.2
I am working on my fork to fix both bugs currently. I can raise the PR if you guys think this should be fixed.
Reproduction
import rich import sglang as sgl from outlines.fsm.json_schema import build_regex_from_schema from sglang.lang import choices from sglang.lang.interpreter import ProgramState from transformers import AutoTokenizer
sgl.set_default_backend(sgl.RuntimeEndpoint("http://localhost:8000")) tokenizer = AutoTokenizer.from_pretrained("meetkai/functionary-small-v3.2")
@sgl.function def generate_answer(s: ProgramState, messages: list, tools: list) -> None: """ Generates an answer using a language model.
def non_stream_response(state: ProgramState, model): tool_calls = [] toolindex = 0 while True: if f"recipient{toolindex}" not in state: break recipient = state[f"recipient{toolindex}"] content = state[f"content{tool_index}"] if recipient != "all": id_characters = ( "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" ) toolcalls.append( { "id": "call" + "".join(random.choices(id_characters, k=29)), "function": { "name": recipient, "arguments": content, }, "type": "function", } ) content = None tool_index += 1
def main(): messages = [ { "role": "user", "content": "What is the respective weather for these 20 cities? Istanbul, Singapore, Beijing, Shanghai, London, Los Angeles, Hanoi, Moscow, Kuala Lumpur, Taipei, Berlin, Anchorage, Paris, Jakarta, Queenstown, Sydney, Auckland, Seoul, Tokyo, Toronto.", }, ] tools = [ { "type": "function", "function": { "name": "get_current_weather", "description": "Get the current weather", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city and state, e.g. San Francisco, CA", } }, "required": ["location"], }, }, } ]
if name == "main": main()