Closed tomaarsen closed 9 months ago
@tomaarsen Cool, thanks! It's very late here, I'll try when I wake.
I added your project to here: https://github.com/h2oai/h2ogpt
Thanks for the wonderfu project!
Yes, I'm aware it doesn't extend the context for input tokens.
One gotcha was that I didn't realize the window size had to be >= input token count. It makes sense, just the failure was not clear.
If possible it would be nice if the window automatically adjusted to the input token size instead of having to always keep it at the max just in case the model is used for more tokens up to the max.
One gotcha was that I didn't realize the window size had to be >= input token count. It makes sense, just the failure was not clear.
After this PR the window size can be less than the input token count, though the excess tokens beyond the window size will be removed as the model generates. It's actually quite normal for tokens to be removed, this is how the memory usage is kept so low while the model stays fluent.
And it's very exciting to see this project included in h2oGPT! I'll try to find some time later to play around with it 😄
Yes thanks, runs my test without any failure even though window size is only 252. Thanks!
Awesome! Thanks for confirming :)
Resolves #22
Hello!
Pull Request overview
attention_mask
if it's larger than the cache.Details
The
attention_mask
intransformers
lives under a condition that it can only grow. Makes sense: we only add a new token on every new model forward call. However, that isn't the case withattention_sinks
. The first forward call will parse the entire input text, and the subsequent one will process just one token, with the history in the cache. However, if the cache is smaller than the input size, theattention_mask
will still have the input size, e.g. [1, 3700] while the history + key_size is just 257.Tests
I ran the script from #22 and got the following output:
cc: @pseudotensor I do want to point out that
attention_sinks
doesn't extend the context size of a model - if the window size is 256 tokens (i.e. 252 window size and 4 sink tokens), then the model won't be able to use the full 3700 tokens of the document when generating its output. You likely know this, but I'm just making sure.