replicate / cog-vicuna-13b

A template to run Vicuna-13B in Cog
https://replicate.com/replicate/llama-7b
Apache License 2.0
73 stars 19 forks source link

max_length parameter is not honored w/ Vicuna13-b #4

Closed rlancemartin closed 1 year ago

rlancemartin commented 1 year ago

Running the model:

output = replicate.run(
    "replicate/vicuna-13b:e6d469c2b11008bb0e446c3e9629232f9674581224536851272c54871f84076e",
    input={"prompt": "Which NFL team won the Super Bowl when Justin Bieber was born? Think step by step.", 
           "temperature":0.75,
           "max_length":500})

We specify max_length:500, so 500 tokens.

But, the output is 940 words, or 940 * 2.5 tokens.

dankolesnikov commented 1 year ago

@bfirsh @mattt @zeke please help!

mattt commented 1 year ago

Hi, @rlancemartin. This sounds more like an issue for the Vicuna-13B model than the Python client library itself, so I'm going to transfer to that repo.

I just tried this myself using the web UI ^1, and the output was within expected range (using OpenAI's tokenizer, 308 tokens; I didn't try feeding it through the model's tokenizer, though).

Can you share any more information to help us diagnose the problem?

/cc @replicate/models

daanelson commented 1 year ago

hey @rlancemartin and @dankolesnikov. This is odd behavior, I can't reproduce it - when I run the model on Replicate it's always truncated when prompt_tokens + generated_tokens = max_tokens. Do you have a prediction uuid for a prediction on Replicate where this occured that I can investigate?

rlancemartin commented 1 year ago

@daanelson thanks!

now i understand max_tokens = prompt_tokens + generated_tokens from here.

i agree: this error is weird.

to reproduce, try this:

import replicate
output = replicate.run(
    "replicate/vicuna-13b:e6d469c2b11008bb0e446c3e9629232f9674581224536851272c54871f84076e",
    input={"prompt": "Which NFL team won the Super Bowl when Justin Bieber was born? Think step by step.", 
           "temperature":0.75,
           "max_length":500})
for i in output:
    print(i)

first run - i get 86 word answer, which is < max_length as expected.

second run - i get 1450 word answer, which exceeds max_length.

third run - i get 969 word answer, which exceeds max_length.

strange!

joehoover commented 1 year ago

@rlancemartin and @dankolesnikov, thanks for raising this!

I investigated this morning and traced the issue to our API. We're tracking it internally now and I'll let you know as soon as it's resolved.

joehoover commented 1 year ago

@rlancemartin and @dankolesnikov, turns out the issue was specifically with the replicate python client.

We have a fix in this branch and we'll have a release out soon!

If you don't want to wait for the release, you can just:

pip install git+https://github.com/replicate/replicate-python.git@fix-iterator-output

rlancemartin commented 1 year ago

Amazing! Will test it out :)

mattt commented 1 year ago

The fix @joehoover mentioned was merged in https://github.com/replicate/replicate-python/pull/106 and released in version 0.8.3.

Please take a look and let us know if you're still seeing this behavior.

This sounds more like an issue for the Vicuna-13B model than the Python client library itself

🙃

Really glad that we identified and (hopefully) addressed the problem. Thank you for opening this issue, @rlancemartin and thanks to @joehoover, @daanelson, and @evilstreak for their quick response.

zeke commented 1 year ago

Late to the party. Nice work, y'all! Can we call this done?

rlancemartin commented 1 year ago

Yes let's close it out. I'll share full analysis soon, but from my initial inspection the issue looks fixed :)

rlancemartin commented 1 year ago

BTW, great results in terms of answer quality w/ this model!

I'm using LangChain auto-evaluator to benchmark it.

But, something is odd w/ latency: the first call to the model is quite slow > 100s, but follow-up calls are fast < 10s.

Ticket here: https://github.com/replicate/cog-vicuna-13b/issues/7

I can reproduce this behavior, so it's not a one-off. Strange.

@joehoover or others any ideas?

Result here: image