tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
463 stars 72 forks source link

Host Tilizer Optimizations for BFP8/4/2 data-formats #8621

Open cfjchu opened 5 months ago

cfjchu commented 5 months ago

We currently have a very unoptimizated implementation that leads to significantly slower performance than single-core on-device tilizer. This issue enumerates a few items to improve its performance. The host vs. device tilizer does not have exact same behaviour (more flexibilty configuring rounding schemes) and ability to tilize offline without having a device present.

  1. multi-threading
  2. cpu intrinsics
  3. removing redundant copies and intermediate conversions to FLOAT32

We already had some pretty optimized host-tilizer implementations in the BUDA side that we may be able to leverage

fyi @davorchap @eyonland @yan-zaretskiy @tt-asaigal

davorchap commented 5 months ago

this would be great, thanks!