TOKEN_SELF_ATTN_VALUE and QK attention

Thanks for sharing the good work. I have a couple questions about the constant TOKEN_SELF_ATTN_VALUE and how it is used.

TOKEN_SELF_ATTN_VALUE is first defined here in reformer_pytorch.py with a comment saying "carefully set for half precision to work". Later, it's used in LSHAttention and FullQKAttention to mask out attention to self except when no other targets are available.