Closed mywoodstock closed 2 weeks ago
Since we essentially need permute
and reshape
operations to perform the fold operation, we will instead look into optimizing permute on device and not use fold op. I will try out the reshape/permute combination in the resnet model to make sure it is functional.
cc: @davorchap
Currently the fold op does not work on WH -- need to get it working -- to be used in RN50. Current perf on GS is at 450ns for half fold (from unit test)