lucidrains / x-transformers

A concise but complete full-attention transformer with a set of promising experimental features from various papers

MIT License

4.63k stars 395 forks source link

Should the mask option in AttentionLayers be using up so much memory?

Behavior:

When I use the mask variant, the memory consumption is 6326MiB, without the mask, the memory consumption is 1364MiB.

6326MiB with the mask, 1364MiB without the mask.

import torch

from x_transformers.x_transformers import AttentionLayers

attn_config = { "dim": 128, "depth": 4, "heads": 6, "ff_mult": 8, "attn_flash": True }

encoder = AttentionLayers(**attn_config) encoder.cuda()

x = torch.randn(10, 4000, 128) mask = torch.ones(10, 4000).bool()

x = x.to('cuda') mask = mask.to('cuda')

while 1: with torch.no_grad():

output = encoder(x, mask=mask)

    output = encoder(x)

lucidrains / x-transformers