ibm-granite / granite-code-models

Granite Code Models: A Family of Open Foundation Models for Code Intelligence
https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330
Apache License 2.0
1.14k stars 81 forks source link

Is softmax scaling optional? #6

Closed turboderp closed 6 months ago

turboderp commented 6 months ago

Can you elaborate on the significance of the softmax scaling? I can't find it referenced in the paper, and it seems to be applied differently for each of the three attention methods in the HF implementation:

Presumably the models are trained with flash-attn so is this just not actually relevant?

mayank31398 commented 6 months ago

None in SDPA or Flash attention is same as 1 / sqrt(d) scale_attn_weights is the parameter used to decide whether to use 1 / sqrt(d) or 1 otherwise. the fp32 arguments is just for stability during training and shouldn't be needed at inference honestly.

turboderp commented 6 months ago

Thank you. :+1: