Closed janelu9 closed 9 months ago
same for baichuan1 RuntimeError: invalid dtype for bias - should match query's dtype
I add a new module PagedAttentionBaichuan
to make sure the alibi is the same as HuggingFace
vllm/vllm/model_executor/layers/attention.py:
class PagedAttentionBaichuan(PagedAttentionWithALiBi):
"""PagedAttention with baichuan's ALiBi attention bias."""
def __init__(self,
num_heads: int,
head_size: int,
scale: float,
slopes: List[float],
num_kv_heads: Optional[int] = None) -> None:
super().__init__(num_heads, head_size, scale, num_kv_heads)
slopes = torch.tensor(slopes, dtype=torch.float32)[:, None, None]
self.register_buffer("alibi_slopes", slopes, persistent=False)
def set_attn_bias(self, input_metadata: InputMetadata,
dtype: torch.dtype) -> None:
if input_metadata.attn_bias:
# Already set by a previous layer.
return
# Generates ALiBi mask for each prompt.
for prompt_len in input_metadata.prompt_lens:
bias = torch.empty(
1, # batch_size
self.num_heads,
prompt_len,
(prompt_len + 7) // 8 * 8,
device=self.alibi_slopes.device,
dtype=dtype,
)[:, :, :, :prompt_len].copy_(torch.arange(prompt_len))
bias.mul_(self.alibi_slopes)
attn_bias = LowerTriangularMaskWithTensorBias(bias)
input_metadata.attn_bias.append(attn_bias)
vllm/vllm/model_executor/models/baichuan.py#L150:
self.attn = PagedAttentionBaichuan(self.num_heads, self.head_dim,
scaling, alibi_slopes)
vllm/csrc/attention/attention_kernels.cu#L181:
qk += alibi_slope * token_idx;
I modified the prompt's position id as _0 to promptlen -1 rather than _-promptlen +1 to 0,
It works when I change the code for creating the attention bias
the basic idea is to pass the query dtype into the set_attn_bias
method self.set_attn_bias(input_metadata, query.dtype)
and change the set_attn_bias
method
def set_attn_bias(self, input_metadata: InputMetadata, dtype: torch.dtype = torch.float32) -> None:
if input_metadata.attn_bias:
# Already set by a previous layer.
return
# Generates ALiBi mask for each prompt.
for prompt_len in input_metadata.prompt_lens:
bias = torch.arange(prompt_len)
# Note(zhuohan): HF uses
# `bias = bias[None, :].repeat(prompt_len, 1)`
# here. We find that both biases give the same results, but
# the bias below more accurately follows the original ALiBi
# paper.
bias = bias[None, :] - bias[:, None]
bias = bias.to(self.alibi_slopes.device)
# When using custom attention bias, xformers requires the bias to
# be sliced from a tensor whose length is a multiple of 8.
padded_len = (prompt_len + 7) // 8 * 8
bias = torch.empty(
1, # batch_size
self.num_heads,
prompt_len,
padded_len,
device=self.alibi_slopes.device,
dtype=dtype # add this
)[:, :, :, :prompt_len].copy_(bias)
bias.mul_(self.alibi_slopes[:, None, None])
attn_bias = LowerTriangularMaskWithTensorBias(bias)
input_metadata.attn_bias.append(attn_bias)
It works when I change the code for creating the attention bias
the basic idea is to pass the query dtype into the
set_attn_bias
methodself.set_attn_bias(input_metadata, query.dtype)
and change theset_attn_bias
methoddef set_attn_bias(self, input_metadata: InputMetadata, dtype: torch.dtype = torch.float32) -> None: if input_metadata.attn_bias: # Already set by a previous layer. return # Generates ALiBi mask for each prompt. for prompt_len in input_metadata.prompt_lens: bias = torch.arange(prompt_len) # Note(zhuohan): HF uses # `bias = bias[None, :].repeat(prompt_len, 1)` # here. We find that both biases give the same results, but # the bias below more accurately follows the original ALiBi # paper. bias = bias[None, :] - bias[:, None] bias = bias.to(self.alibi_slopes.device) # When using custom attention bias, xformers requires the bias to # be sliced from a tensor whose length is a multiple of 8. padded_len = (prompt_len + 7) // 8 * 8 bias = torch.empty( 1, # batch_size self.num_heads, prompt_len, padded_len, device=self.alibi_slopes.device, dtype=dtype # add this )[:, :, :, :prompt_len].copy_(bias) bias.mul_(self.alibi_slopes[:, None, None]) attn_bias = LowerTriangularMaskWithTensorBias(bias) input_metadata.attn_bias.append(attn_bias)
Did you find huggingface's alibi is diffrent from vllm's alibi of baichuan? One is increasing from 0, another is increasing to 0.
It works when I change the code for creating the attention bias
the basic idea is to pass the query dtype into the
set_attn_bias
methodself.set_attn_bias(input_metadata, query.dtype)
and change theset_attn_bias
methoddef set_attn_bias(self, input_metadata: InputMetadata, dtype: torch.dtype = torch.float32) -> None: if input_metadata.attn_bias: # Already set by a previous layer. return # Generates ALiBi mask for each prompt. for prompt_len in input_metadata.prompt_lens: bias = torch.arange(prompt_len) # Note(zhuohan): HF uses # `bias = bias[None, :].repeat(prompt_len, 1)` # here. We find that both biases give the same results, but # the bias below more accurately follows the original ALiBi # paper. bias = bias[None, :] - bias[:, None] bias = bias.to(self.alibi_slopes.device) # When using custom attention bias, xformers requires the bias to # be sliced from a tensor whose length is a multiple of 8. padded_len = (prompt_len + 7) // 8 * 8 bias = torch.empty( 1, # batch_size self.num_heads, prompt_len, padded_len, device=self.alibi_slopes.device, dtype=dtype # add this )[:, :, :, :prompt_len].copy_(bias) bias.mul_(self.alibi_slopes[:, None, None]) attn_bias = LowerTriangularMaskWithTensorBias(bias) input_metadata.attn_bias.append(attn_bias)
it works for me too
It works when I change the code for creating the attention bias https://github.com/vllm-project/vllm/blob/852ef5b4f5481ce526c804ea234d1de0df91f48d/vllm/model_executor/layers/attention.py#L199C1-L199C47 the basic idea is to pass the query dtype into the
set_attn_bias
methodself.set_attn_bias(input_metadata, query.dtype)
and change theset_attn_bias
methoddef set_attn_bias(self, input_metadata: InputMetadata, dtype: torch.dtype = torch.float32) -> None: if input_metadata.attn_bias: # Already set by a previous layer. return # Generates ALiBi mask for each prompt. for prompt_len in input_metadata.prompt_lens: bias = torch.arange(prompt_len) # Note(zhuohan): HF uses # `bias = bias[None, :].repeat(prompt_len, 1)` # here. We find that both biases give the same results, but # the bias below more accurately follows the original ALiBi # paper. bias = bias[None, :] - bias[:, None] bias = bias.to(self.alibi_slopes.device) # When using custom attention bias, xformers requires the bias to # be sliced from a tensor whose length is a multiple of 8. padded_len = (prompt_len + 7) // 8 * 8 bias = torch.empty( 1, # batch_size self.num_heads, prompt_len, padded_len, device=self.alibi_slopes.device, dtype=dtype # add this )[:, :, :, :prompt_len].copy_(bias) bias.mul_(self.alibi_slopes[:, None, None]) attn_bias = LowerTriangularMaskWithTensorBias(bias) input_metadata.attn_bias.append(attn_bias)
Did you find huggingface's alibi is diffrent from vllm's alibi of baichuan? One is increasing from 0, another is increasing to 0.
I do find the vllm's baichuan results are misaligned with huggingface's, I will check it and try your code later!
same here, and I'm using vllm 0.1.6. but it works fine when I used vllm 0.1.3..
i found comment in vllm src code(/vllm/model_executor/layers/attention.py line355), it says it's the same of those two biases:
vllm==0.1.5