LLama3 streaming repeats the previous request's first token.

mikutsky commented 5 months ago

Hi! I'm running into a problem of repeating the first token in subsequent requests using a stream. The prompt structure follows the Meta LLama3 documentation. Could you explain why is this going on?

Simple chat example output looks in this way:

The model name is meta/meta-llama-3-70b-instruct

You: Hi!
Assistant: Hi! How can I help you today?

You: Recommend me a Hemingway novel, please.
Assistant: Hi
I'd recommend "The Old Man and the Sea". It's a classic, concise, and powerful novel that showcases Hemingway's unique writing style.

You: I read it, please recommend something else.
Assistant: Hi
I
How about "A Farewell to Arms"? It's a romantic and tragic novel set during WWI, and it's considered one of Hemingway's best works.

You: It's great! Thank you! Bye!
Assistant: Hi
I
How
You're welcome! I'm glad you enjoyed the recommendation. Have a great day and happy reading! Bye!

Example code:

import os
from replicate.client import Client

replicate_api_key = os.getenv("REPLICATE_API_TOKEN", 'EMPTY')
replicate_model = os.getenv('REPLICATE_MODEL', 'meta/meta-llama-3-70b-instruct')
replicate_client = Client(api_token=replicate_api_key)

SYSTEM_PROMPT = 'You are a helpful assistant. Answer briefly!'
MESSAGES = []

def gen_llama3_prompt(sys_prompt=None, messages=None):
    sys_prompt = '' if sys_prompt is None else sys_prompt
    messages = [] if messages is None else messages
    _result = f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{sys_prompt}<|eot_id|>"
    for m in messages:
        if m['role'] == 'user':
            _result += f'<|start_header_id|>user<|end_header_id|>\n\n{m["content"]}<|eot_id|>'
        elif m['role'] == 'assistant':
            _result += f'<|start_header_id|>assistant<|end_header_id|>\n\n{m["content"]}<|eot_id|>'
    _result += '<|start_header_id|>assistant<|end_header_id|>\n\n'
    return _result

def print_answer(query=''):
    message = {'role': 'user', 'content': query}
    answer = ''
    MESSAGES.append(message)
    for event in replicate_client.stream(
            "meta/meta-llama-3-70b-instruct",
            input={
                "top_p": 1e-5,
                "prompt": gen_llama3_prompt(SYSTEM_PROMPT, MESSAGES),
                "max_tokens": 512,
                "min_tokens": 0,
                "temperature": 1e-6
            }):
        token = str(event)
        answer += token
        print(token, end='')
    message = {'role': 'assistant', 'content': answer}
    MESSAGES.append(message)

if __name__ == '__main__':
    print(f'Model name is {replicate_model}')
    while True:
        q = input('\nYou: ')
        print('Assistant: ', end='')
        print_answer(q)
        if 'bye' in q.lower():
            break

Thanks for your help!

mattt commented 5 months ago

Hi @mikutsky. Thanks for reporting this. Can you share any predictions for these? (Go to your replicate.com Dashboard, look under Predictions). Seeing that would help us tell if the problem is in the model or the client library.

Gusakovskyi commented 5 months ago

Hi, have the same issue

mattt commented 5 months ago

@Gusakovskyi @mikutsky We've confirmed that there's an issue with stop sequences for meta/meta-llama-3-70b-instruct, and we're working on a fix.

mikutsky commented 5 months ago

Hi @mikutsky. Thanks for reporting this. Can you share any predictions for these? (Go to your replicate.com Dashboard, look under Predictions). Seeing that would help us tell if the problem is in the model or the client library.

It looks like the client library problem. I provide you second query info. Because the next queries collect mistakes in the prompt.

Everything looks correct on the dashboard: iScreen Shoter - Google Chrome - 240419235931

Here is the prompt for the second query, and the prompt is still correct:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant. Answer briefly!<|eot_id|><|start_header_id|>user<|end_header_id|>

Hi!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hi! How can I help you today?<|eot_id|><|start_header_id|>user<|end_header_id|>

I read it, please recommend something else.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

However console output contains the extra tag 'Hi':

You: >? Hi!
Assistant: Hi! How can I help you today?

You: >? I read it, please recommend something else.
Assistant: Hi
I'd be happy to! However, I need a bit more information. What type of content are you in the mood for? A book, article, podcast, or something else?

mattt commented 5 months ago

@mikutsky We just pushed a new build of the model, which should address the stop sequence problem. Please give your client code another try and let me know if that's working for you now.

If not, could you please try calling replicate.stream in isolation? I'd like to rule out the use of input and accessing mutable state in a loop, even though that should be running synchronously and not be a problem.

mattt commented 5 months ago

Actually, I'm able to reproduce this in isolation, so it does appear to be an issue with the client. Working on a fix now.

mattt commented 5 months ago

@mikutsky @Gusakovskyi Thanks again for reporting. This should be fixed by 0.25.2.

Please let me know if you continue to see this behavior.

mikutsky commented 5 months ago

@mikutsky @Gusakovskyi Thanks again for reporting. This should be fixed by 0.25.2.

Please let me know if you continue to see this behavior.

Thanks a lot! It works!

replicate / replicate-python

LLama3 streaming repeats the previous request's first token. #287