Closed Abhishek-Varma closed 1 week ago
I have tried this on 64x64x64 bf16->f32 = it works with correct results - here is the e2e IR
I have tried this on 128x128x256 bf16 -> f32 = there are numerical mismatches in some places whilst most elements are correct - here is the e2e IR (Same case for 1024x1024x1024) I speculated that this might be because of bf16 -> bfp16 but then this should've been the case for all values, but it's happening for only some of them - perhaps some tiling issue?!
NOTE: I have added the e2e test of 64x64x64 in run_matmul_test.sh
for now for ease of adding the tests - it can be ported to run.py
later.
I speculated that this might be because of bf16 -> bfp16 but then this should've been the case for all values, but it's happening for only some of them - perhaps some tiling issue?!
Also, the values seem off quite a bit, which wouldn't be expected from bf16 -> bfp16, so, there is definitely something wrong here.
For strix, wouldn't a 'simpler' matmul to test be i8->i32? Because of the complexities around bf16 -> blf16 -> bf16 conversion.
@Abhishek-Varma I can't see anything obviously wrong/incompatible with the ukernel which would cause numerical error for 128x128 (but pass for 64x64), so I think trying it through peano might be a good idea.
@Abhishek-Varma For me we can merge this for the shape that works after rebase + fixing the print statements
@Abhishek-Varma For me we can merge this for the shape that works after rebase + fixing the print statements
Fine by me too. Please add a comment where the 64x64x64 test is created in run.py that failure is expected for the 128x128 case, and that we don't yet understand why.
-- This commit replaces
mm.cc
ukernel file with two separate files now :mm_npu1.cc
for Phoenix andmm_npu4.cc
for Strix. Therefore during xclbin's generation, based on the target device, the corresponding ukernel file is compiled. -- It also makes changes tolower-to-ukernels
pass to adapt to different target device.Signed-off-by: Abhishek Varma abhvarma@amd.com