Closed huu4ontocord closed 3 years ago
So it looks like using sliding chunk mult is not the way to go. I can't figure out what's happening to the attn_scores and how it is shaped to be able to apply the position bias to it.
# POSITION_BIAS here: stack 2*one_sided_attn_window_size+1 worth of bias in the last dimension
position_bias2 = self._sliding_chunks_query_key_matmul(
position_bias.new_ones(size=position_bias.size()), position_bias, self.one_sided_attn_window_size
)
Thanks, @ontocord! It would be great if we can get an LED based on T5. We gave it a try but the PR is still WIP. Check here: https://github.com/allenai/longformer/pull/149 IIRC, the key idea is in this function: https://github.com/allenai/longformer/blob/t5/longformer/longformer.py#L144-L157 If this is not helpful enough, please let me know and I can explain it in more detail later.
@ibeltagy, what do you think of something like this? I think it works!! The relative position tensor is over the window_overlap (128), and not the attention_window (512)
relative_position = torch.tensor([[i-window_overlap for i in range(2*window_overlap+1)]])
relative_position_bucket = self._relative_position_bucket(
relative_position, # shape (query_length, key_length)
bidirectional=True,
num_buckets=self.relative_attention_num_buckets,
)
relative_position_bucket = relative_position_bucket.to(self.relative_attention_bias.weight.device)
values = self.relative_attention_bias(relative_position_bucket) # shape (query_length, key_length, num_heads)
position_bias = values.permute([0, 2, 1]).unsqueeze(0) # shape (1, num_heads, query_length, key_length)
And the test:
from transformers import AutoTokenizer, pipelines
model = T5ForConditionalGeneration.from_pretrained('t5-small-long')
tokenizer = AutoTokenizer.from_pretrained("t5-small")
tokenizer.model_max_length=1000000000
#print (tokenizer)
p = pipelines.pipeline("text2text-generation", model=model, tokenizer=tokenizer, device=0)
print (p("""question: Where was Lincoln born? context:
Abraham Lincoln (/ˈlɪŋkən/; February 12, 1809 – April 15, 1865) was an American statesman and lawyer who served as the 16th president of the United States from 1861 until his assassination in 1865. Lincoln led the nation through the American Civil War, the country's greatest moral, constitutional, and political crisis. He succeeded in preserving the Union, abolishing slavery, bolstering the federal government, and modernizing the U.S. economy.
Lincoln was born into poverty in a log cabin and was raised on the frontier primarily in Indiana. He was self-educated and became a lawyer, Whig Party leader, Illinois state legislator, and U.S. Congressman from Illinois. In 1849, he returned to his law practice but became vexed by the opening of additional lands to slavery as a result of the Kansas–Nebraska Act. He reentered politics in 1854, becoming a leader in the new Republican Party, and he reached a national audience in the 1858 debates against Stephen Douglas. Lincoln ran for President in 1860, sweeping the North in victory. Pro-slavery elements in the South equated his success with the North's rejection of their right to practice slavery, and southern states began seceding from the union. To secure its independence, the new Confederate States fired on Fort Sumter, a U.S. fort in the South, and Lincoln called up forces to suppress the rebellion and restore the Union.
As the leader of moderate Republicans, Lincoln had to navigate a contentious array of factions with friends and opponents on both sides. War Democrats rallied a large faction of former opponents into his moderate camp, but they were countered by Radical Republicans, who demanded harsh treatment of the Southern Confederates. Anti-war Democrats (called "Copperheads") despised him, and irreconcilable pro-Confederate elements plotted his assassination. Lincoln managed the factions by exploiting their mutual enmity, by carefully distributing political patronage, and by appealing to the U.S. people. His Gettysburg Address became a historic clarion call for nationalism, republicanism, equal rights, liberty, democracy and freedom.
"""))
[{'generated_text': 'Indiana'}]
... But asking the question in t5-long: Who hated Lincoln? I get:
[{'generated_text': 'anti-war Democrats (called "Copperheads") despised him, and irre'}]
But asking in t5-small, I get:
{'generated_text': 'Anti-war Democrats'}]
I think there's something going on with the relative_position still (maybe in the extra column?)
I've updated the code on my repository so you can see.
relative_position = torch.tensor([[i-window_overlap for i in range(2*window_overlap+1)]])
The relative position tensor is over the window_overlap (128), and not the attention_window (512)
For an attention_window = 512
, the relative positions need to be from -256 to 256. What you have here is -128 to 128.
I am not sure how the -128 to 128 works, it will give you a tensor with dimensions that don't fit here attn_scores += diagonal_mask + position_bias2
.
And the test:
I would recommend a unit test with input seqlen < 512, then assert that the hidden states you get from t5-small-long
perfectly match those from t5-small
. This helps with debugging because if hidden stats don't match, you can step through both models to find the discrepancy.
@ibeltagy , my mistake. Yes the overlap window is 256, not 128. I meant the code should refer to window_overlap, which made it work. The code you referenced in https://github.com/allenai/longformer/blob/t5/longformer/longformer.py#L144-L157 refers to the whole attention_window*2 which would cause issues.
relative_position = torch.tensor([[i-self.attention_window for i in range(2*self.attention_window+1)]])
There are still bugs, so I'll do the step through of each hidden_state per your suggestion. Thanks again!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
🚀 Adding Longformer Encoder Decoder support for T5
LED is great for doing long form encoder decoder of documents, but it is based only on BART. T5 has certain advantages, such as being designed for multi tasks (QA, summarization, etc.) and having relative positioning.
T5 uses relative positioning which maps well to doing sliding chunks and should not require additional training to learn new relative position buckets. Adding LED support will permit any already trained T5 models to be used efficiently on long document.
I've started incorporating LED features into the encoder portion of T5 but have some quesitons about the position_bias and implementation details of t5 and LED. With some help on understanding how sliding window multiplcation works in LED and how relative position is organized, I think I can finish the impelmentation.
In particular, T5 passes a position_bias that along with the mask as added in each layer. This bias is added to each score before performing a softmax.
I've surmised that I can add the position_bias to the mask in the long former self attention, and then that should mostly be the same as the orginal t5 self attention.
T5's position_bias is in the shape of (batch_size, n_heads, seq_length, key_length) . But the mask used for LED is in the form of (batch_size, seq_length), which is then mapped to n_heads and then through sliding multiplication to stack the mask. I permute the postion_bias, and then run through sliding multiplication to stack the bias so that the posiion bias can db added to the mask.
I tried a test of attention_window size of 512 with exactly 512 worth of tokens, which should make it equivalent to t5 self attention. But something seems to be off.
The encoder produces a tensor that suprisingly can be decoded by the decoder, which is encouraging, but it's not producing an answer for QA for example.
I noticed that t5 doesn't use sqrt (key value proj dim) normalization, and has an extra mapping through tensor o. I tried with and without sqrt but no good either way.
Am I getting something mixed up with the position_bias?
@ibeltagy @patrickvonplaten @sgugger any help would be much appreciated. Happy to contribute this as a PR when completed.
Current code: https://github.com/ontocord/t5_led/blob/main/t5_ext.py
relevant portion: