huu4ontocord commented 3 years ago

🚀 Adding Longformer Encoder Decoder support for T5

LED is great for doing long form encoder decoder of documents, but it is based only on BART. T5 has certain advantages, such as being designed for multi tasks (QA, summarization, etc.) and having relative positioning.

T5 uses relative positioning which maps well to doing sliding chunks and should not require additional training to learn new relative position buckets. Adding LED support will permit any already trained T5 models to be used efficiently on long document.

I've started incorporating LED features into the encoder portion of T5 but have some quesitons about the position_bias and implementation details of t5 and LED. With some help on understanding how sliding window multiplcation works in LED and how relative position is organized, I think I can finish the impelmentation.

In particular, T5 passes a position_bias that along with the mask as added in each layer. This bias is added to each score before performing a softmax.

I've surmised that I can add the position_bias to the mask in the long former self attention, and then that should mostly be the same as the orginal t5 self attention.

T5's position_bias is in the shape of (batch_size, n_heads, seq_length, key_length) . But the mask used for LED is in the form of (batch_size, seq_length), which is then mapped to n_heads and then through sliding multiplication to stack the mask. I permute the postion_bias, and then run through sliding multiplication to stack the bias so that the posiion bias can db added to the mask.

I tried a test of attention_window size of 512 with exactly 512 worth of tokens, which should make it equivalent to t5 self attention. But something seems to be off.

The encoder produces a tensor that suprisingly can be decoded by the decoder, which is encouraging, but it's not producing an answer for QA for example.

I noticed that t5 doesn't use sqrt (key value proj dim) normalization, and has an extra mapping through tensor o. I tried with and without sqrt but no good either way.

Am I getting something mixed up with the position_bias?

@ibeltagy @patrickvonplaten @sgugger any help would be much appreciated. Happy to contribute this as a PR when completed.

Current code: https://github.com/ontocord/t5_led/blob/main/t5_ext.py

relevant portion:

    def forward_long(
        self,
        hidden_states,
        mask=None,
        position_bias=None,
        layer_head_mask=None,
        is_index_masked=None,
        is_index_global_attn=None,
        is_global_attn=None,
        output_attentions=False,
        compute_relative_attention_bias=False,
        query_states = None,
        query_mask = None,
        layer_id=0,
    ):
        """
        :class:`LEDEncoderSelfAttention` expects `len(hidden_states)` to be multiple of `attention_window`. Padding to
        `attention_window` happens in :meth:`LEDEncoderModel.forward` to avoid redoing the padding on each layer.
        The `mask` is changed in :meth:`LEDEncoderModel.forward` from 0, 1, 2 to:
            * -10000: no attention
            * 0: local attention
            * +10000: global attention
        """

        batch_size, seq_length = hidden_states.shape[:2]

        if position_bias is None:
            if not self.has_relative_attention_bias or not compute_relative_attention_bias:
                position_bias = torch.zeros(
                    (1, self.n_heads, seq_length, seq_lenth),  device=hidden_states.device, dtype=hidden_states.dtype
                )
            else:
                position_bias = self.compute_bias(seq_length, seq_length,  False)  # (batch_size, n_heads, seq_length, key_length) 
            position_bias = position_bias.permute(0, 2, 1, 3) 
            print ("ccompute bias 2", position_bias.size())

        hidden_states = hidden_states.transpose(0, 1)
        if query_states is None:
            query_states = hidden_states
        # project hidden states
        if query_mask is not None:
            query_vectors = self.q(query_states) * query_mask.unsqueeze(-1).expand(-1, -1, query.shape[-1]) 
        else:
            query_vectors = self.q(query_states)

        key_vectors = self.k(hidden_states)
        value_vectors = self.v(hidden_states)

        seq_len, batch_size, embed_dim = hidden_states.size()
        assert (
            embed_dim == self.embed_dim
        ), f"hidden_states should have embed_dim = {self.embed_dim}, but has {embed_dim}"

        # normalize query - T5 does not do the sqrt???
        query_vectors /= math.sqrt(self.key_value_proj_dim)

        query_vectors = query_vectors.view(seq_len, batch_size, self.n_heads, self.key_value_proj_dim).transpose(0, 1)
        key_vectors = key_vectors.view(seq_len, batch_size, self.n_heads, self.key_value_proj_dim).transpose(0, 1)

        attn_scores = self._sliding_chunks_query_key_matmul(
            query_vectors, key_vectors, self.one_sided_attn_window_size
        )

        # values to pad for attention probs
        remove_from_windowed_mask = (mask != 0)[:, :, None, None]

        # cast to fp32/fp16 then replace 1's with -inf
        float_mask = remove_from_windowed_mask.type_as(query_vectors).masked_fill(
            remove_from_windowed_mask, -10000.0
        )

        # POSITION_BIAS here: stack 2*one_sided_attn_window_size+1 worth of bias in the last dimension
        position_bias2 = self._sliding_chunks_query_key_matmul(
            position_bias.new_ones(size=position_bias.size()), position_bias, self.one_sided_attn_window_size
        )

        # diagonal mask with zeros everywhere and -inf inplace of padding
        diagonal_mask = self._sliding_chunks_query_key_matmul(
            float_mask.new_ones(size=float_mask.size()), float_mask, self.one_sided_attn_window_size
        )

        # pad local attention probs and add the position bias
        attn_scores += diagonal_mask + position_bias2

        assert list(attn_scores.size()) == [
            batch_size,
            seq_len,
            self.n_heads,
            self.one_sided_attn_window_size * 2 + 1,
        ], f"local_attn_probs should be of size ({batch_size}, {seq_len}, {self.n_heads}, {self.one_sided_attn_window_size * 2 + 1}), but is of size {attn_scores.size()}"

        # compute local attention probs from global attention keys and contact over window dim
        if is_global_attn:
            # compute global attn indices required through out forward fn
            (
                max_num_global_attn_indices,
                is_index_global_attn_nonzero,
                is_local_index_global_attn_nonzero,
                is_local_index_no_global_attn_nonzero,
            ) = self._get_global_attn_indices(is_index_global_attn)
            # calculate global attn probs from global key

            global_key_attn_scores = self._concat_with_global_key_attn_probs(
                query_vectors=query_vectors,
                key_vectors=key_vectors,
                max_num_global_attn_indices=max_num_global_attn_indices,
                is_index_global_attn_nonzero=is_index_global_attn_nonzero,
                is_local_index_global_attn_nonzero=is_local_index_global_attn_nonzero,
                is_local_index_no_global_attn_nonzero=is_local_index_no_global_attn_nonzero,
            )
            # concat to local_attn_probs
            # (batch_size, seq_len, n_heads, extra attention count + 2*window+1)
            attn_scores = torch.cat((global_key_attn_scores, attn_scores), dim=-1)

            # free memory
            del global_key_attn_scores

        attn_probs = F.softmax(attn_scores, dim=-1, dtype=torch.float32)  # use fp32 for numerical stability

        if layer_head_mask is not None:
            assert layer_head_mask.size() == (
                self.n_heads,
            ), f"Head mask for a single layer should be of size {(self.n_heads,)}, but is {layer_head_mask.size()}"
            attn_probs = layer_head_mask.view(1, 1, -1, 1) * attn_probs

        # softmax sometimes inserts NaN if all positions are masked, replace them with 0

        attn_probs = torch.masked_fill(attn_probs, is_index_masked[:, :attn_probs.size()[1], None, None], 0.0)
        attn_probs = attn_probs.type_as(attn_scores)

        # free memory
        del attn_scores

        # apply dropout
        attn_probs = F.dropout(attn_probs, p=self.dropout, training=self.training)

      value_vectors = value_vectors.view(seq_len, batch_size, self.n_heads, self.key_value_proj_dim).transpose(0, 1)

        # compute local attention output with global attention value and add
        if is_global_attn:
            # compute sum of global and local attn
            attn_output = self._compute_attn_output_with_global_indices(
                value_vectors=value_vectors,
                attn_probs=attn_probs,
                max_num_global_attn_indices=max_num_global_attn_indices,
                is_index_global_attn_nonzero=is_index_global_attn_nonzero,
                is_local_index_global_attn_nonzero=is_local_index_global_attn_nonzero,
            )
        else:
            # compute local attn only
            attn_output = self._sliding_chunks_matmul_attn_probs_value(
                attn_probs, value_vectors, self.one_sided_attn_window_size
            )

        assert attn_output.size() == (batch_size, seq_len, self.n_heads, self.key_value_proj_dim), "Unexpected size"
        attn_output = attn_output.transpose(0, 1).reshape(seq_len, batch_size, embed_dim).contiguous()

        # compute value for global attention and overwrite to attention output
        # TODO: remove the redundant computation
        if is_global_attn:

            global_attn_output, global_attn_probs = self._compute_global_attn_output_from_hidden(
                hidden_states=hidden_states,
                max_num_global_attn_indices=max_num_global_attn_indices,
                layer_head_mask=layer_head_mask,
                is_local_index_global_attn_nonzero=is_local_index_global_attn_nonzero,
                is_index_global_attn_nonzero=is_index_global_attn_nonzero,
                is_local_index_no_global_attn_nonzero=is_local_index_no_global_attn_nonzero,
                is_index_masked=is_index_masked,
            )

            # get only non zero global attn output
            nonzero_global_attn_output = global_attn_output[
                is_local_index_global_attn_nonzero[0], :, is_local_index_global_attn_nonzero[1]
            ]

            # overwrite values with global attention
            attn_output[is_index_global_attn_nonzero[::-1]] = nonzero_global_attn_output.view(
                len(is_local_index_global_attn_nonzero[0]), -1
            )
            # The attention weights for tokens with global attention are
            # just filler values, they were never used to compute the output.
            # Fill with 0 now, the correct values are in 'global_attn_probs'.
            attn_probs[is_index_global_attn_nonzero] = 0

        attn_output = attn_output.transpose(0, 1)
        # t5 runs the attn_output through o, and expects attn_output to be (batch_size, seq_length, dim)
        attn_output = self.o(attn_output)

        present_key_value_state = None
        outputs = (attn_output,) + (present_key_value_state,) + (position_bias,)

        if output_attentions:
            outputs = outputs + (attn_weights,)

        return outputs + (global_attn_probs,) if (is_global_attn and output_attentions) else outputs

huu4ontocord commented 3 years ago

So it looks like using sliding chunk mult is not the way to go. I can't figure out what's happening to the attn_scores and how it is shaped to be able to apply the position bias to it.

        # POSITION_BIAS here: stack 2*one_sided_attn_window_size+1 worth of bias in the last dimension
        position_bias2 = self._sliding_chunks_query_key_matmul(
            position_bias.new_ones(size=position_bias.size()), position_bias, self.one_sided_attn_window_size
        )

ibeltagy commented 3 years ago

Thanks, @ontocord! It would be great if we can get an LED based on T5. We gave it a try but the PR is still WIP. Check here: https://github.com/allenai/longformer/pull/149 IIRC, the key idea is in this function: https://github.com/allenai/longformer/blob/t5/longformer/longformer.py#L144-L157 If this is not helpful enough, please let me know and I can explain it in more detail later.

huu4ontocord commented 3 years ago

@ibeltagy, what do you think of something like this? I think it works!! The relative position tensor is over the window_overlap (128), and not the attention_window (512)

   relative_position = torch.tensor([[i-window_overlap for i in range(2*window_overlap+1)]])
                relative_position_bucket = self._relative_position_bucket(
                    relative_position,  # shape (query_length, key_length)
                    bidirectional=True,
                    num_buckets=self.relative_attention_num_buckets,
                )
                relative_position_bucket = relative_position_bucket.to(self.relative_attention_bias.weight.device)
                values = self.relative_attention_bias(relative_position_bucket)  # shape (query_length, key_length, num_heads)
                position_bias = values.permute([0, 2, 1]).unsqueeze(0)  # shape (1, num_heads, query_length, key_length)

And the test:

        from transformers import AutoTokenizer, pipelines
        model = T5ForConditionalGeneration.from_pretrained('t5-small-long')
        tokenizer = AutoTokenizer.from_pretrained("t5-small")
        tokenizer.model_max_length=1000000000
        #print (tokenizer)
        p = pipelines.pipeline("text2text-generation", model=model, tokenizer=tokenizer, device=0)
        print (p("""question: Where was Lincoln born? context: 
Abraham Lincoln (/ˈlɪŋkən/; February 12, 1809 – April 15, 1865) was an American statesman and lawyer who served as the 16th president of the United States from 1861 until his assassination in 1865. Lincoln led the nation through the American Civil War, the country's greatest moral, constitutional, and political crisis. He succeeded in preserving the Union, abolishing slavery, bolstering the federal government, and modernizing the U.S. economy.

Lincoln was born into poverty in a log cabin and was raised on the frontier primarily in Indiana. He was self-educated and became a lawyer, Whig Party leader, Illinois state legislator, and U.S. Congressman from Illinois. In 1849, he returned to his law practice but became vexed by the opening of additional lands to slavery as a result of the Kansas–Nebraska Act. He reentered politics in 1854, becoming a leader in the new Republican Party, and he reached a national audience in the 1858 debates against Stephen Douglas. Lincoln ran for President in 1860, sweeping the North in victory. Pro-slavery elements in the South equated his success with the North's rejection of their right to practice slavery, and southern states began seceding from the union. To secure its independence, the new Confederate States fired on Fort Sumter, a U.S. fort in the South, and Lincoln called up forces to suppress the rebellion and restore the Union.

As the leader of moderate Republicans, Lincoln had to navigate a contentious array of factions with friends and opponents on both sides. War Democrats rallied a large faction of former opponents into his moderate camp, but they were countered by Radical Republicans, who demanded harsh treatment of the Southern Confederates. Anti-war Democrats (called "Copperheads") despised him, and irreconcilable pro-Confederate elements plotted his assassination. Lincoln managed the factions by exploiting their mutual enmity, by carefully distributing political patronage, and by appealing to the U.S. people. His Gettysburg Address became a historic clarion call for nationalism, republicanism, equal rights, liberty, democracy and freedom.
"""))

[{'generated_text': 'Indiana'}]

... But asking the question in t5-long: Who hated Lincoln? I get:

[{'generated_text': 'anti-war Democrats (called "Copperheads") despised him, and irre'}]

But asking in t5-small, I get:

{'generated_text': 'Anti-war Democrats'}]

I think there's something going on with the relative_position still (maybe in the extra column?)

I've updated the code on my repository so you can see.

ibeltagy commented 3 years ago

relative_position = torch.tensor([[i-window_overlap for i in range(2*window_overlap+1)]]) The relative position tensor is over the window_overlap (128), and not the attention_window (512)

For an attention_window = 512, the relative positions need to be from -256 to 256. What you have here is -128 to 128. I am not sure how the -128 to 128 works, it will give you a tensor with dimensions that don't fit here attn_scores += diagonal_mask + position_bias2.

And the test:

I would recommend a unit test with input seqlen < 512, then assert that the hidden states you get from t5-small-long perfectly match those from t5-small. This helps with debugging because if hidden stats don't match, you can step through both models to find the discrepancy.

huu4ontocord commented 3 years ago

@ibeltagy , my mistake. Yes the overlap window is 256, not 128. I meant the code should refer to window_overlap, which made it work. The code you referenced in https://github.com/allenai/longformer/blob/t5/longformer/longformer.py#L144-L157 refers to the whole attention_window*2 which would cause issues.

relative_position = torch.tensor([[i-self.attention_window for i in range(2*self.attention_window+1)]])

There are still bugs, so I'll do the step through of each hidden_state per your suggestion. Thanks again!

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers

Adding Longformer Encoder Decoder support for T5 #10432

🚀 Adding Longformer Encoder Decoder support for T5