We currently have a very unoptimizated implementation that leads to significantly slower performance than single-core on-device tilizer. This issue enumerates a few items to improve its performance. The host vs. device tilizer does not have exact same behaviour (more flexibilty configuring rounding schemes) and ability to tilize offline without having a device present.
multi-threading
cpu intrinsics
removing redundant copies and intermediate conversions to FLOAT32
We already had some pretty optimized host-tilizer implementations in the BUDA side that we may be able to leverage
We currently have a very unoptimizated implementation that leads to significantly slower performance than single-core on-device tilizer. This issue enumerates a few items to improve its performance. The host vs. device tilizer does not have exact same behaviour (more flexibilty configuring rounding schemes) and ability to tilize offline without having a device present.
We already had some pretty optimized host-tilizer implementations in the BUDA side that we may be able to leverage
fyi @davorchap @eyonland @yan-zaretskiy @tt-asaigal