Open tutankhamen-1 opened 1 year ago
It can encode 2K tokens, and output 2K tokens, a total of 4K tokens. But it cannot take in 4K tokens along. @tutankhamen-1. In contrast, Llama-like model encode+output 2K tokens.
It can encode 2K tokens, and output 2K tokens, a total of 4K tokens. But it cannot take in 4K tokens along. @tutankhamen-1. In contrast, Llama-like model encode+output 2K tokens.
That’s great, but the 2K total limit seems to be hardcoded in many places and I can’t get it to work. I’m trying to use it through the API.
The current behavior should be correct? It can only encode 2K tokens, which is the hardcoded places you see. But it can output another 2K tokens. If you use Llama (vicuna), it can encode 2K tokens, but when you give 2K tokens to it, it cannot output anything.
This is the error message I get:
This model's maximum context length is 2048 tokens. However, you requested 2302 tokens (1790 in the messages, 512 in the completion). Please reduce the length of the messages or completion.
Model: fastchat-t5-3b-v1.0
@tutankhamen-1 Thanks for letting us know! We will fix it. @merrymercy Let's change the error message for T5?
Isn't this limit arbitrary for t5 due to its attention mechanism? My understanding is that it uses quadratic memory as the context length goes up, but as long as you have the ram to support it, a longer context length isn't limited by the model itself.
@tutankhamen-1 Could you help us fix the bug and contribute a pull request?
lmsys.org states that FastChat-T5 supports a context size of 4K. How do I get it to work? I get an error as soon as I go above 2K.