Just found that this mask replaces all 0's in attention scores cache (past current token) to big negative value (DEFAULT_MAX_VALUE), this blows interaction_strength = mx.mean(mx.abs(attention_scores), axis=(1, 2, 3)) to inf
Another note: this mask is not present in main entropix repo, i think its safe to remove it
Just found that this mask replaces all 0's in attention scores cache (past current token) to big negative value (DEFAULT_MAX_VALUE), this blows
interaction_strength = mx.mean(mx.abs(attention_scores), axis=(1, 2, 3))
to infAnother note: this mask is not present in main entropix repo, i think its safe to remove it