karpathy / nanoGPT

The simplest, fastest repository for training/finetuning medium-sized GPTs.
MIT License
32.68k stars 5.01k forks source link

Why is there no mask when using flash attention? #451

Closed bruce2233 closed 2 months ago

bruce2233 commented 2 months ago

https://github.com/karpathy/nanoGPT/blob/325be85d9be8c81b436728a420e85796c57dba7e/model.py#L61-L71

If so, is GPT a decoder-only transformer?

muerghq commented 2 months ago

is_causal=True applies the causal masking. "attn_mask" in the function allows user to pass user-defined attention masking. Users use one of them to set the masking. You will get an error if both are set.

What this repo implemented is a decoder-only transformer with causal masking. If no mask is applied, it's an encoder-only transformer.

bruce2233 commented 2 months ago

is_causal=True applies the causal masking. "attn_mask" in the function allows user to pass user-defined attention masking. Users use one of them to set the masking. You will get an error if both are set.

What this repo implemented is a decoder-only transformer with causal masking. If no mask is applied, it's an encoder-only transformer.

Thanks, I get it. https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html#

170