Closed rlancemartin closed 1 year ago
@bfirsh @mattt @zeke please help!
Hi, @rlancemartin. It sounds like this is more of an issue for the Vicuna-13B model than the client library, so I'm going to transfer it to that repository.
max_new_tokens
is not specified on the web UI, but I run it with the API using the same prompt.
The inputs for a model on Replicate are defined by the predict()
function parameters in the Cog model source code. Passing values for parameters not defined by the model either has no effect, or may cause a validation error.
The message you saw in the logs was generated by a function called internally by the predict
function, and the max_new_tokens
referenced is a parameter in that method, not the model itself. If you want the Replicate model to behave differently, you can either fork and publish your own version of the model or try get those changes merged into the upstream repository and published.
/cc @replicate/models
hey @rlancemartin and @dankolesnikov. There's a mistake in our documentation for max_length
- max_length
is the maximum length of the prompt + the output for a given generation. You can see the documentation of the underlying HuggingFace method here.
max_new_tokens
is another param that the HF api can use for generation; we're not exposing that parameter at the moment, though we can in the future. Will update the docs, sorry about the confusion!
re: n_ctx - Vicuna was trained with a context window of 2048 tokens, so that's the context we provide with this API. Generally speaking, providing inputs longer than a model's context window will result in poor quality output - even if you have enough compute to run the model, since it wasn't trained on that much context it likely won't perform well.
Llama.cpp does allow you to increase the context window, but indicates that you should expect poor results when doing so here
@mattt thanks!
max_length is the maximum length of the prompt + the output for a given generation
@daanelson ah, got it! makes sense. that is what i thought it might be :)
re: n_ctx - Vicuna was trained with a context window of 2048 tokens, so that's the context we provide with this API.
makes sense!
perfect.
Locally, I run Vicuna-13 with the LlamaCpp bindings:
I specify the context window
n_ctx
to2048
tokens.This works with my
prompt
, which is314
words (314 * 2.5
tokens).Using the web UI, I also try
prompt
I see:
max_new_tokens
is not specified on the web UI, but I run it with the API using the same prompt.I still get no output.
However, I bump
max_length
to2000
on the webUI.And it does work.
Why does the
Maximum number of tokens to generate
need to be increased in order to process a larger prompt?And can the
n_ctx=2048
simply be specified?