Open rossbm opened 3 weeks ago
I've tried running on another VM where I've installed flash_attn, but I'm still getting the error. Maybe the issue is that the slicing tokens aren't being applied to the attention mask.
From https://github.com/unslothai/unsloth/blob/main/unsloth/models/llama.py
if sliding_window is not None and kv_seq_len > sliding_window:
# From https://github.com/huggingface/transformers/blob/main/src/transformers/models/mistral/modeling_mistral.py#L193
slicing_tokens = 1 - sliding_window
Knn = Kn[:, :, slicing_tokens:, :]#.contiguous()
Vnn = Vn[:, :, slicing_tokens:, :]#.contiguous()
While in https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/blob/main/modeling_phi3.py and https://github.com/huggingface/transformers/blob/main/src/transformers/models/mistral/modeling_mistral.py we have:
if attention_mask is not None:
attention_mask = attention_mask[:, slicing_tokens:]
attention_mask = torch.cat([attention_mask, torch.ones_like(attention_mask[:, -1:])], dim=-1)
@rossbm Much apologies on the delay - my bro and I just relocated to SF, so sorry on the delay - appreciate the investigation as well!
I shall check if I'm doing inference on SWAs correctly :) Thanks for the report!
I've been finetuning unsloth/Phi-3-mini-4k-instruct-bnb-4bit with a T4, which doesn't support flash attention, so I don't have it installed.
During evaluation, I've been running into the following error:
The batch that is being evaluated at this point has 2955 tokens. However, unsloth/Phi-3-mini-4k-instruct-bnb-4bit should support sequence lengths of 4096 tokens, and I make certain to set
max_seq_length
to 4096 when initializing the model.Looking through the model config for unsloth/Phi-3-mini-4k-instruct-bnb-4bit, I see
sliding_window": 2048,
which would be the only place that a length of 2048 (or 2047) would be coming from.In: https://github.com/unslothai/unsloth/blob/933d9fe2cb2459f949ee2250e90a5b610d277eab/unsloth/models/llama.py#L189, we have:
if sliding_window is not None and kv_seq_len > sliding_window:
However, in https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/blob/main/modeling_phi3.py, there's a check if flash_attn is installed and and supports a sliding window:
before the sliding window is used:
Sure enough, when I set
model.config.sliding_window = 10_000
I am able to successfully callmodel.generate()
on the batch that was giving me theRuntimeError: The expanded size of the tensor (2047) ...
error.So I think that the solution is to update
if sliding_window is not None and kv_seq_len > sliding_window:
to check if flash-attention is installed and supports window size, similar to what phi-3 is doing.