I have some questions about implementation details

jinmang2 commented 3 years ago

Thanks for making your implementation public. I have three questions about your h-transformer 1d implementation.

1. The number of levels `M`

https://github.com/lucidrains/h-transformer-1d/blob/63063d5bb036b56a7205aadc5c8198da02d698f6/h_transformer_1d/h_transformer_1d.py#L105-L114

In papers, eq (32) gives a guide on how M is determined.

In your code implementations,

n is sequence length which is not padded
bsz is block size (is same to N_r which is numerical rank of the off-diagonal blocks)
Because code line 111 already contains level 0, M is equal to int(log2(n // bsz)) - 1

However, in the Section 6.1 Constructing Hierarchical Attention, I found that sequence length(L) must be a multiple of 2. In my opinion, eq (31)'s L is equal to 2^{M+1}. In implementation, n is not padded sequence. So one of M is missing.

Since x is the sequence padded by processing below, https://github.com/lucidrains/h-transformer-1d/blob/63063d5bb036b56a7205aadc5c8198da02d698f6/h_transformer_1d/h_transformer_1d.py#L83-L91

I think the above implementation should be modified as below

107    num_levels = int(log2(x.size(1) // bsz)) - 1

2. Super- and Sub-diagonal blocks of the coarsened matrix `\tilde{A}` as level-`l`

https://github.com/lucidrains/h-transformer-1d/blob/63063d5bb036b56a7205aadc5c8198da02d698f6/h_transformer_1d/h_transformer_1d.py#L190-L198

Ys conatins y and A computed as calculate_Y_and_A. For examples,

# Ys
[
    (batch_size*n_heads, N_b(2), N_r), (batch_size*n_heads, N_b(2)),  # level 2, (Y(2), tilde_A(2))
    (batch_size*n_heads, N_b(1), N_r), (batch_size*n_heads, N_b(1)),  # level 1, (Y(1), tilde_A(1))
    (batch_size*n_heads, N_b(0), N_r), (batch_size*n_heads, N_b(0)),  # level 0, (Y(0), tilde_A(0))
]

In eq (29), Y is calculated as Y = AV = Y(0) + P(0)( Y(1) + P(1)Y(2) ) However, in code line 190, Y is calculated using only level-0 and level-1 blocks, no matter how many M there are. Y = AV = Y(0) + P(0)Y(1)

Does increasing the level cause performance degradation issues in implementation? I'm so curious!

3. Comparison with Luna: Linear Unified Nested Attention

h-transformer significantly exceeded the scores of BigBird and Luna in LRA. However, what I regretted while reading the paper was that there was no comparison of computation time with other sub-quadratic and Luna. Is this algorithm much faster than other sub-quadratic? And how about compared to Luna?

Thanks again for the implementation release! The idea of calculating off-diagonal with flip was amazing and I learned a lot. Thank you!! 😄

lucidrains commented 3 years ago

@jinmang2 Hi, thank you for catching these bugs :) I've put in fixes in the latest commit

I think Luna is a strong contender, for the encoder. The recent results deepmind has released with Perceiver https://github.com/lucidrains/perceiver-pytorch would corroborate that approach

jinmang2 commented 3 years ago

Thanks for the reply! I will also read perceiver 🥰

jinmang2 commented 3 years ago

Thank you for fix this bug! 5660abd Close this issue :D

lucidrains / h-transformer-1d