In mmav3 case the number of elements per threads should be independent of the element type, we should only consider kWidth.
TODO: it should also be true for MMAv2 but the logic is a bit more complicated.
Also enable larger block_m in mixed mode tests to exercise MMAv3 case
In mmav3 case the number of elements per threads should be independent of the element type, we should only consider kWidth. TODO: it should also be true for MMAv2 but the logic is a bit more complicated.
Also enable larger block_m in mixed mode tests to exercise MMAv3 case