salesforce / CodeGen

CodeGen is a family of open-source model for program synthesis. Trained on TPU-v4. Competitive with OpenAI Codex.
Apache License 2.0
4.94k stars 381 forks source link

Mismatch in attention weights for causal masked tokens vs attention masked tokens #49

Open LakshyAAAgrawal opened 1 year ago

LakshyAAAgrawal commented 1 year ago

attention scores corresponding to the tokens that are masked out using attention_mask get a value of -1e4 as per https://github.com/salesforce/CodeGen/blob/main/jaxformer/hf/codegen/modeling_codegen.py#L439, whereas the attention scores masked out using causal_mask get a value of -1*1e9. This leads to a discrepancy between the pre-softmax attention scores for causally masked tokens and padded tokens to be different. This causes inference outputs from individual sequences to inference outputs to be different.