nod-ai / sharktank

SHARK Inference Modeling and Serving
Apache License 2.0
7 stars 9 forks source link

[punet] Update quantizer to allow explicit mixed precision rescale. #51

Closed stellaraccident closed 1 month ago

stellaraccident commented 1 month ago

Previously, the dtype of the scale dictated the output dtype after dequant. This makes it impossible to execute in low-p (i.e. fp16) while preserving rescale computation in high-p (i.e. fp32). The latter is needed to avoid integer->float overflow after integer arithmetic. (There are other ways to factor this to avoid higher precision scales but this is simple/standard)

Also enables quantized bias, since this avoids the overflow and fixes an unnecessarily large eps.