Open Benetti-Hub opened 1 month ago
I have similar doubts about this problem. I don't know what will happen in casual mask after using flash attention. `# q * k^t / sqrt(hs)
# att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float('-inf'))
# att = F.softmax(att, dim=-1)
# y = att @ v
y = F.scaled_dot_product_attention(q,k,v,is_causal=True)`
If it is not used, I think T does not need to be strictly equal to blocksize.
First and foremost, I want to express my appreciation for this tutorial. It's incredibly insightful and well-structured.
I'm submitting this PR because I noticed a potential issue related to
GPTConfig.block_size
not being enforced to match the sequence lengthT
.If I understand correctly, this discrepancy could lead to unexpected model behavior during inference if
T
is lower thanGPTConfig.block_size
. (Note that an assertion error is already raised whenT
exceedsGPTConfig.block_size
, as seen here).Thank you for considering this change. Please let me know if any further adjustments are needed.