Thanks for sharing the good work. I have a couple questions about the constant TOKEN_SELF_ATTN_VALUE and how it is used.
TOKEN_SELF_ATTN_VALUE is first defined here in reformer_pytorch.py with a comment saying "carefully set for half precision to work". Later, it's used in LSHAttention and FullQKAttention to mask out attention to self except when no other targets are available.
How is the value of -5e4 decided?
Why does the QK attention require tokens not attend to self?
Thanks for sharing the good work. I have a couple questions about the constant
TOKEN_SELF_ATTN_VALUE
and how it is used.TOKEN_SELF_ATTN_VALUE
is first defined here in reformer_pytorch.py with a comment saying "carefully set for half precision to work". Later, it's used inLSHAttention
andFullQKAttention
to mask out attention to self except when no other targets are available.-5e4
decided?Thank you!