karpathy / build-nanogpt

Video+code lecture on building nanoGPT from scratch
3.44k stars 473 forks source link

Ensure Consistency Between GPTConfig.block_size and Sequence Length T #72

Open Benetti-Hub opened 1 month ago

Benetti-Hub commented 1 month ago

First and foremost, I want to express my appreciation for this tutorial. It's incredibly insightful and well-structured.

I'm submitting this PR because I noticed a potential issue related to GPTConfig.block_size not being enforced to match the sequence length T.

If I understand correctly, this discrepancy could lead to unexpected model behavior during inference if T is lower than GPTConfig.block_size . (Note that an assertion error is already raised when T exceeds GPTConfig.block_size, as seen here).

Thank you for considering this change. Please let me know if any further adjustments are needed.

zhaziqwe commented 1 month ago

I have similar doubts about this problem. I don't know what will happen in casual mask after using flash attention. `# q * k^t / sqrt(hs)

att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))

    # att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float('-inf'))
    # att = F.softmax(att, dim=-1)
    # y = att @ v
    y = F.scaled_dot_product_attention(q,k,v,is_causal=True)` 

If it is not used, I think T does not need to be strictly equal to blocksize.