Using gemm to calculate matmul and expectation

MrRobot2211 commented 3 years ago

This is mainly a question to @leofang. Is it okey to use to use gemm from cupy/cublas.py or some modification of it for calculating expectations and matmul ? Or is there some better abstraction/function?. My hunch is that as gemm let's you multiply by the transpose/adjint without explicitly transposing or adjointing , it will be more time and memory efficient , I would also expect calling gemm to have some advantage over raw @. But by using the CuBlas handle directly we may be loosing on cub/cutensor accelerations.

leofang commented 3 years ago

If your concern is whether cupy.cublas are consider public API, I'd say the answer is yes, as It has a full test coverage. I just don't know why it's not documented, I can ask the team whether it's an overlook or done on purpose.

If you wonder about functionality, both cupy.cublas.gemm() and cupy.matmul() (and hence the @ operator) internally call cuBLAS's GEMM routine. I think for the purpose of efficiency (in both memory and time) either is fine, but it's best that you profile with some typical problem sizes first.

It's unclear to me what CUB/cuTENSOR accelerations that you're referring to, could you elaborate? CUB does not provide GEMM-like routines, and cuTENSOR is only used in reduction (ex: sum()) & contraction (ex: einsum()).

MrRobot2211 commented 3 years ago

Sorry, I was a bit cryptic.

1) Yes, it is strange it is not documented. 2) Yes I suppose they are doing the same but cublas.gemm exposes transa and transb so if I need to do V* @ Op @ V I can just pass the argument without getting the adjoint copy of V, But I will loose StridedBatched. 3) Sorry I got confunsed looking at tensordot_core. Is there a reason for not having a cuTensor matmul ?

Using the functions from here here, I should be able to implement gemmBatched if needed. Maybe it would be good to know if it is already in CuPy's roadmap.

leofang commented 3 years ago

Oops sorry I dropped the ball!

Yes I suppose they are doing the same but cublas.gemm exposes transa and transb so if I need to do V* @ Op @ V I can just pass the argument without getting the adjoint copy of V, But I will loose StridedBatched.

Using the functions from here here, I should be able to implement gemmBatched if needed. Maybe it would be good to know if it is already in CuPy's roadmap.

I'd say let's give cupy.matmul() a try and not spend too much time on the "infrastructural" part (calling gemmBatched yourself). matmul already handles the trans flags internally for you. As long as the input arrays are contiguous, extra copies should not incur. Repeating trivial things all over again is just wasting efforts IMHO.

Sorry I got confunsed looking at tensordot_core. Is there a reason for not having a cuTensor matmul ?

Because cuTENSOR is not designed for such operation. It's for tensor contractions, reduction and elementwise operations. Linear algebra operations are done in cuBLAS (some of whose functionalities are backed by CUTLASS).

hodgestar commented 3 years ago

It looks like this discussion reached a conclusion and matmul and expect are now implemented.

qutip / qutip-cupy

Using gemm to calculate matmul and expectation #23