Closed pseudotensor closed 9 months ago
Similar thing if use model_id = "h2oai/h2ogpt-4096-llama2-7b-chat"
.
If I set attention_sink_window_size=4096,
then it doesn't fail for mistral. Do I have to set the window size larger or equal to the input token size?
For mpt it still fails with some other error about 2048 vs. input token size, so maybe that mpt is not compatible fully.
If I set
attention_sink_window_size=4096
, then it doesn't fail for mistral. Do I have to set the window size larger or equal to the input token size?
That was my initial intuition, also because 4096 is likely the window size that you want if you want it to be able to use the last 4k tokens in memory. However, it shouldn't throw a CUDA indexing error either way - I'm looking into it now.
For MPT, you have to edit the configuration if you want to use anything over 2048 tokens: https://github.com/tomaarsen/attention_sinks/blob/607d8304c9383447fd8f79efed676a8c0651e0d5/benchmark/scripts/benchmark_mpt.sh#L6-L16
That said, I'm not sure if MPT can reasonably process sequences longer than 2048, I think the model implodes after 2048, but perhaps not with attention_sinks? Definitely worth a try.
I actually get a different error when running your code:
Traceback (most recent call last):
File "[sic]\attention_sinks\issue_22.py", line 412, in <module>
generated_tokens = model.generate(
File "[sic]\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "[sic]\transformers\src\transformers\generation\utils.py", line 1658, in generate
return self.greedy_search(
File "[sic]\transformers\src\transformers\generation\utils.py", line 2506, in greedy_search
outputs = self(
File "[sic]\\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "[sic]\transformers\src\transformers\models\mistral\modeling_mistral.py", line 1048, in forward
outputs = self.model(
File "[sic]\\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "[sic]\attention_sinks\attention_sinks\inject_mixin.py", line 131, in wrapped_forward
outputs = old_forward(*args, **kwargs)
File "[sic]\transformers\src\transformers\models\mistral\modeling_mistral.py", line 891, in forward
attention_mask = self._prepare_decoder_attention_mask(
File "[sic]\transformers\src\transformers\models\mistral\modeling_mistral.py", line 813, in _prepare_decoder_attention_mask
expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
RuntimeError: The size of tensor a (3817) must match the size of tensor b (3818) at non-singleton dimension 3
Will dig into this deeper.
Edit: This is probably because I'm using the wrong transformers
version. My bad
@pseudotensor Perhaps you can experiment with
pip install git+https://github.com/tomaarsen/attention_sinks.git@hotfix/long_input_seq
I'll do some more tests of my own later.
Thanks! I should clarify I'm using transformers==4.34.1
-- I had upgraded just in case it would help with the failure but it didn't change anything.
I'll check in morning w.r.t. the PR.
Hi, just a quick comment. I see the same error with Mistral and when I use a attention_sink_window_size=2300
it works and attention_sink_window_size=2200
it fails with the out of bounds. Since mistral has 4096 sliding windows it could be that the error is somehow related to a different issue.
@FrankEssenberger When using the branch from #23, the main
branch, or the latest release?
Also, do you know roughly your input data length? That could also be related, e.g. if the input is 2250 tokens or so.
I commented in the PR, but the same inference code but using that PR hit no error, thanks!
@FrankEssenberger When using the branch from #23, the
main
branch, or the latest release?Also, do you know roughly your input data length? That could also be related, e.g. if the input is 2250 tokens or so.
Sorry I was on holiday - it worked with the latest version of the code. Thanks.
Tried this: https://github.com/tomaarsen/attention_sinks/issues/1#issuecomment-1745792500
Idea in below repro is to use longer context and still continue to generate outside normal context size. Actually mistral does this already without attention sinks to some extent, but attention sinks just fails. There are about 3817 tokens for input.
fails with:
This is on a 4*A6000 (each 48GB) system and each GPU is only using about 17%. Single GPU does similar:
or sometimes like: