Closed conceptofmind closed 2 years ago
Sorry, I'm not sure what the problem is. ALiBi should be re-implemented by HuggingFace soon (since it's a component in the BigScience BLOOM model) so you can check out their code too.
Hi @ofirpress,
Thank you for taking a look.
I ended up resolving the issue by using a custom Alibi class with the help of a peer.
Best,
Enrico
For anyone who is interested:
# AliBi
class AlibiPositionalBias(nn.Module):
def __init__(self, heads, **kwargs):
super().__init__()
self.heads = heads
slopes = torch.Tensor(self._get_slopes(heads))
slopes = rearrange(slopes, 'h -> h 1 1')
self.register_buffer('slopes', slopes, persistent = False)
self.register_buffer('bias', None, persistent = False)
def get_bias(self, i, j, device):
i_arange = torch.arange(i, device = device)
j_arange = torch.arange(j, device = device)
bias = -torch.abs(rearrange(j_arange, 'j -> 1 1 j') - rearrange(i_arange, 'i -> 1 i 1'))
return bias
@staticmethod
def _get_slopes(heads):
def get_slopes_power_of_2(n):
start = (2**(-2**-(log2(n)-3)))
ratio = start
return [start*ratio**i for i in range(n)]
if log2(heads).is_integer():
return get_slopes_power_of_2(heads)
closest_power_of_2 = 2 ** floor(log2(heads))
return get_slopes_power_of_2(closest_power_of_2) + get_slopes_power_of_2(2 * closest_power_of_2)[0::2][:heads-closest_power_of_2]
def forward(self, qk_sim):
h, i, j, device = *qk_sim.shape[-3:], qk_sim.device
if exists(self.bias) and self.bias.shape[-1] >= j:
return qk_sim + self.bias[..., :i, :j]
bias = self.get_bias(i, j, device)
bias = bias * self.slopes
num_heads_unalibied = h - bias.shape[0]
bias = F.pad(bias, (0, 0, 0, 0, 0, num_heads_unalibied))
self.register_buffer('bias', bias, persistent=False)
return bias
Hi @ofirpress ,
I am working on implementing ALiBi in a Parallel Attention Transformer. I have removed the positional embeddings from the model. I set up a relative alibi bias matrix and calculated the slopes. I then add the alibi attention bias to the causal mask. Unfortunately, I am unable to get the correct number of trainable parameters. Is it possible to take a quick look and see if there is anything noticeably wrong in the code implementation below?
Code
Function for slopes:
Calculate the alibi bias:
Build the Parallel Attention Block:
Any help would be greatly appreciated!
Thank you,
Enrico