Previously, the dtype of the scale dictated the output dtype after dequant. This makes it impossible to execute in low-p (i.e. fp16) while preserving rescale computation in high-p (i.e. fp32). The latter is needed to avoid integer->float overflow after integer arithmetic. (There are other ways to factor this to avoid higher precision scales but this is simple/standard)
Also enables quantized bias, since this avoids the overflow and fixes an unnecessarily large eps.
Previously, the dtype of the scale dictated the output dtype after dequant. This makes it impossible to execute in low-p (i.e. fp16) while preserving rescale computation in high-p (i.e. fp32). The latter is needed to avoid integer->float overflow after integer arithmetic. (There are other ways to factor this to avoid higher precision scales but this is simple/standard)
Also enables quantized bias, since this avoids the overflow and fixes an unnecessarily large eps.