You should consider increasing `max_new_tokens` error

rlancemartin commented 1 year ago

Locally, I run Vicuna-13 with the LlamaCpp bindings:

            llm = LlamaCpp(
                model_path=LLAMA_CPP_PATH+"models/vicuna_13B/ggml-vicuna-13b-4bit.bin",
                callback_manager=callback_manager,
                verbose=True,
                n_threads=6,
                n_ctx=2048,
                use_mlock=True)

I specify the context window n_ctx to 2048 tokens.

This works with my prompt, which is 314 words (314 * 2.5 tokens).

Using the web UI, I also try prompt

I see:

Running predict()...
Input length of input_ids is 583, but `max_length` is set to 500. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.

max_new_tokens is not specified on the web UI, but I run it with the API using the same prompt.

import replicate
output = replicate.run(
    "replicate/vicuna-13b:e6d469c2b11008bb0e446c3e9629232f9674581224536851272c54871f84076e",
    input={"prompt": prompt, 
           "temperature":0.75,
           "max_new_tokens":1000,
           "max_length":500})
for message in output:
    print(message)

I still get no output.

However, I bump max_length to 2000 on the webUI.

And it does work.

Why does the Maximum number of tokens to generate need to be increased in order to process a larger prompt?

And can the n_ctx=2048 simply be specified?

dankolesnikov commented 1 year ago

@bfirsh @mattt @zeke please help!

mattt commented 1 year ago

Hi, @rlancemartin. It sounds like this is more of an issue for the Vicuna-13B model than the client library, so I'm going to transfer it to that repository.

max_new_tokens is not specified on the web UI, but I run it with the API using the same prompt.

The inputs for a model on Replicate are defined by the predict() function parameters in the Cog model source code. Passing values for parameters not defined by the model either has no effect, or may cause a validation error.

The message you saw in the logs was generated by a function called internally by the predict function, and the max_new_tokens referenced is a parameter in that method, not the model itself. If you want the Replicate model to behave differently, you can either fork and publish your own version of the model or try get those changes merged into the upstream repository and published.

/cc @replicate/models

daanelson commented 1 year ago

hey @rlancemartin and @dankolesnikov. There's a mistake in our documentation for max_length - max_length is the maximum length of the prompt + the output for a given generation. You can see the documentation of the underlying HuggingFace method here.

max_new_tokens is another param that the HF api can use for generation; we're not exposing that parameter at the moment, though we can in the future. Will update the docs, sorry about the confusion!

re: n_ctx - Vicuna was trained with a context window of 2048 tokens, so that's the context we provide with this API. Generally speaking, providing inputs longer than a model's context window will result in poor quality output - even if you have enough compute to run the model, since it wasn't trained on that much context it likely won't perform well.

Llama.cpp does allow you to increase the context window, but indicates that you should expect poor results when doing so here

rlancemartin commented 1 year ago

@mattt thanks!

max_length is the maximum length of the prompt + the output for a given generation

@daanelson ah, got it! makes sense. that is what i thought it might be :)

re: n_ctx - Vicuna was trained with a context window of 2048 tokens, so that's the context we provide with this API.

makes sense!

perfect.

replicate / cog-vicuna-13b

You should consider increasing `max_new_tokens` error #3