Apparent problem with Llama3 chat template

Hi,

I am sending requests to a llama-server by using the openai api. I also wrote the code in pytorch without a server to compare the results. I noticed that the in the first case the text generation does not stop after after giving an answer and keeps tellling me about climate change. When running the corresponding pytorch code, the generation is stopped appropriately and the quality of the answer is way better. This is a behaviour I would expect if there is an issue with the chat_template but I am using the exact same format I found in examples. This the code in pytorch:

def respond() -> str:

    user_prompt = ""

    messages = [{"role": "system", "content": ''},
                     {"role": "user", "content":user_prompt}]

    text = tokenizer.apply_chat_template(
          messages,
          tokenize=False,
          add_generation_prompt=True
          )

    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

    streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

    generate_kwargs = dict(
        input_ids= model_inputs.input_ids,
        streamer=streamer,
        max_new_tokens=1024,
        #do_sample=True,
        temperature=0.01,
        eos_token_id=terminators,
        )

    thread = Thread(target=model.generate, kwargs=generate_kwargs)
    thread.start()

    generated_text = " "
    for text in streamer:
        generated_text += text
        print(text, end='', flush=True)

    end_time = time.time()

and this is the code using the openai api

def stream_document():
    # OpenAI client setup
    client = openai.OpenAI(
        base_url="",  # API-Server URL
        api_key="sk-no-key-required"
    )
    user_prompt = ""
    messages = [{"role": "system", "content": ''},
                     {"role": "user", "content":user_prompt}]

    text=messages

    response = client.chat.completions.create(
        #model="gpt-3.5-turbo",
        model="Llama3",
        messages = text,
        stream=True, # Enable streaming
        temperature=0.01,
        max_completion_tokens=1024
    )
    # Process each chunk of data as it comes in
    for chunk in response:
        # Accessing the choices in the chunk
        for choice in chunk.choices:
            # Accessing the delta content within each choice
            if choice.delta and choice.delta.content:
                print(choice.delta.content, end='', flush=True)  # Print content without newline
    print("\nStream finished.")

Is there some way to give special tokens or specify the chat template to the client?

Regards

openai / openai-openapi

Apparent problem with Llama3 chat template #330