Found a bug in the test where the reference is computed with fp32 while the cutlass result is fp16 (create_executor seems to accept fp16 np arrays even if the model expects fp32...)
I attempted using Relax with Legalize for the reference computation (possible since https://github.com/tlc-pack/relax/pull/425), but it looks like there are non-trivial accuracy difference between fp16 tensorcore and x86 results. I got the max and mean absolute diff of 0.25 and 0.007496 respectively, while the output matches exactly if we compute the reference using Relay + CUDA (which also uses tensorcore).
Found a bug in the test where the reference is computed with fp32 while the cutlass result is fp16 (
create_executor
seems to accept fp16 np arrays even if the model expects fp32...)I attempted using Relax with Legalize for the reference computation (possible since https://github.com/tlc-pack/relax/pull/425), but it looks like there are non-trivial accuracy difference between fp16 tensorcore and x86 results. I got the max and mean absolute diff of 0.25 and 0.007496 respectively, while the output matches exactly if we compute the reference using Relay + CUDA (which also uses tensorcore).
Also, the recent PR https://github.com/tlc-pack/relax/pull/416 seems to have change how a Relay op is printed, which broke CUTLASS Relay BYOC. Fixed now.
@vinx13