[CUTLASS] Fix test accuracy

Found a bug in the test where the reference is computed with fp32 while the cutlass result is fp16 (create_executor seems to accept fp16 np arrays even if the model expects fp32...)

I attempted using Relax with Legalize for the reference computation (possible since https://github.com/tlc-pack/relax/pull/425), but it looks like there are non-trivial accuracy difference between fp16 tensorcore and x86 results. I got the max and mean absolute diff of 0.25 and 0.007496 respectively, while the output matches exactly if we compute the reference using Relay + CUDA (which also uses tensorcore).

Also, the recent PR https://github.com/tlc-pack/relax/pull/416 seems to have change how a Relay op is printed, which broke CUTLASS Relay BYOC. Fixed now.

@vinx13

tlc-pack / relax

[CUTLASS] Fix test accuracy #433