Closed rlancemartin closed 1 year ago
@bfirsh @mattt @zeke please help!
Hi, @rlancemartin. This sounds more like an issue for the Vicuna-13B model than the Python client library itself, so I'm going to transfer to that repo.
I just tried this myself using the web UI ^1, and the output was within expected range (using OpenAI's tokenizer, 308 tokens; I didn't try feeding it through the model's tokenizer, though).
Can you share any more information to help us diagnose the problem?
/cc @replicate/models
hey @rlancemartin and @dankolesnikov. This is odd behavior, I can't reproduce it - when I run the model on Replicate it's always truncated when prompt_tokens
+ generated_tokens
= max_tokens
. Do you have a prediction uuid for a prediction on Replicate where this occured that I can investigate?
@daanelson thanks!
now i understand max_tokens = prompt_tokens + generated_tokens
from here.
i agree: this error is weird.
to reproduce, try this:
import replicate
output = replicate.run(
"replicate/vicuna-13b:e6d469c2b11008bb0e446c3e9629232f9674581224536851272c54871f84076e",
input={"prompt": "Which NFL team won the Super Bowl when Justin Bieber was born? Think step by step.",
"temperature":0.75,
"max_length":500})
for i in output:
print(i)
first run - i get 86 word
answer, which is < max_length
as expected.
second run - i get 1450 word
answer, which exceeds max_length
.
third run - i get 969 word
answer, which exceeds max_length
.
strange!
@rlancemartin and @dankolesnikov, thanks for raising this!
I investigated this morning and traced the issue to our API. We're tracking it internally now and I'll let you know as soon as it's resolved.
@rlancemartin and @dankolesnikov, turns out the issue was specifically with the replicate python client.
We have a fix in this branch and we'll have a release out soon!
If you don't want to wait for the release, you can just:
pip install git+https://github.com/replicate/replicate-python.git@fix-iterator-output
Amazing! Will test it out :)
The fix @joehoover mentioned was merged in https://github.com/replicate/replicate-python/pull/106 and released in version 0.8.3.
Please take a look and let us know if you're still seeing this behavior.
This sounds more like an issue for the Vicuna-13B model than the Python client library itself
🙃
Really glad that we identified and (hopefully) addressed the problem. Thank you for opening this issue, @rlancemartin and thanks to @joehoover, @daanelson, and @evilstreak for their quick response.
Late to the party. Nice work, y'all! Can we call this done?
Yes let's close it out. I'll share full analysis soon, but from my initial inspection the issue looks fixed :)
BTW, great results in terms of answer quality w/ this model!
I'm using LangChain auto-evaluator to benchmark it.
But, something is odd w/ latency: the first call to the model is quite slow > 100s, but follow-up calls are fast < 10s.
Ticket here: https://github.com/replicate/cog-vicuna-13b/issues/7
I can reproduce this behavior, so it's not a one-off. Strange.
@joehoover or others any ideas?
Result here:
Running the model:
We specify
max_length:500
, so500 tokens
.But, the output is
940 words
, or940 * 2.5 tokens
.