Closed kuhar closed 1 year ago
This is an initialized tiled implementation that could use integer dot product instructions (depending on how the driver compiler).
It achieves ~190 GFLOps, compared to ~230 with i8->i32 outer product and ~345 with i8->f32->i32 outer product.
This is an initialized tiled implementation that could use integer dot product instructions (depending on how the driver compiler).
It achieves ~190 GFLOps, compared to ~230 with i8->i32 outer product and ~345 with i8->f32->i32 outer product.