attention scores corresponding to the tokens that are masked out using attention_mask get a value of -1e4 as per https://github.com/salesforce/CodeGen/blob/main/jaxformer/hf/codegen/modeling_codegen.py#L439, whereas the attention scores masked out using causal_mask get a value of -1*1e9. This leads to a discrepancy between the pre-softmax attention scores for causally masked tokens and padded tokens to be different. This causes inference outputs from individual sequences to inference outputs to be different.
attention scores corresponding to the tokens that are masked out using attention_mask get a value of -1e4 as per https://github.com/salesforce/CodeGen/blob/main/jaxformer/hf/codegen/modeling_codegen.py#L439, whereas the attention scores masked out using causal_mask get a value of -1*1e9. This leads to a discrepancy between the pre-softmax attention scores for causally masked tokens and padded tokens to be different. This causes inference outputs from individual sequences to inference outputs to be different.