Fix the order of weight when quantizing the weights: [0:7] -> 0 2 4 6 1 3 5 7
Fix the reference implementation when accessing the weights.
Fix the unit test in test_op and write another small test to make sure the reference implementation is consistent with the Cuda kernel.
Testing:
Unit test with the reference implementation
Unit test output is consistent with Intel implementation
Known issue:
The end-to-end inference is still not working(tested on server GPU). This could be a problem with memory allocation. I found the demo application does not use much memory in GPU, which is not expected.
Fix the order of weight when quantizing the weights: [0:7] -> 0 2 4 6 1 3 5 7 Fix the reference implementation when accessing the weights. Fix the unit test in test_op and write another small test to make sure the reference implementation is consistent with the Cuda kernel.
Testing:
Known issue: