Closed MrRobot2211 closed 3 years ago
If your concern is whether cupy.cublas
are consider public API, I'd say the answer is yes, as It has a full test coverage. I just don't know why it's not documented, I can ask the team whether it's an overlook or done on purpose.
If you wonder about functionality, both cupy.cublas.gemm()
and cupy.matmul()
(and hence the @
operator) internally call cuBLAS's GEMM routine. I think for the purpose of efficiency (in both memory and time) either is fine, but it's best that you profile with some typical problem sizes first.
It's unclear to me what CUB/cuTENSOR accelerations that you're referring to, could you elaborate? CUB does not provide GEMM-like routines, and cuTENSOR is only used in reduction (ex: sum()
) & contraction (ex: einsum()
).
Sorry, I was a bit cryptic.
1) Yes, it is strange it is not documented.
2) Yes I suppose they are doing the same but cublas.gemm
exposes transa
and transb
so if I need to do V* @ Op @ V
I can just pass the argument without getting the adjoint copy of V, But I will loose StridedBatched
.
3) Sorry I got confunsed looking at tensordot_core. Is there a reason for not having a cuTensor
matmul
?
Using the functions from here here, I should be able to implement gemmBatched
if needed. Maybe it would be good to know if it is already in CuPy's roadmap.
Oops sorry I dropped the ball!
- Yes I suppose they are doing the same but
cublas.gemm
exposestransa
andtransb
so if I need to doV* @ Op @ V
I can just pass the argument without getting the adjoint copy of V, But I will looseStridedBatched
.Using the functions from here here, I should be able to implement
gemmBatched
if needed. Maybe it would be good to know if it is already in CuPy's roadmap.
I'd say let's give cupy.matmul()
a try and not spend too much time on the "infrastructural" part (calling gemmBatched
yourself). matmul
already handles the trans
flags internally for you. As long as the input arrays are contiguous, extra copies should not incur. Repeating trivial things all over again is just wasting efforts IMHO.
- Sorry I got confunsed looking at tensordot_core. Is there a reason for not having a
cuTensor
matmul
?
Because cuTENSOR is not designed for such operation. It's for tensor contractions, reduction and elementwise operations. Linear algebra operations are done in cuBLAS (some of whose functionalities are backed by CUTLASS).
It looks like this discussion reached a conclusion and matmul and expect are now implemented.
This is mainly a question to @leofang. Is it okey to use to use
gemm
from cupy/cublas.py or some modification of it for calculating expectations and matmul ? Or is there some better abstraction/function?. My hunch is that as gemm let's you multiply by the transpose/adjint without explicitly transposing or adjointing , it will be more time and memory efficient , I would also expect calling gemm to have some advantage over raw@
. But by using theCuBlas
handle directly we may be loosing on cub/cutensor accelerations.