Open qikunxun opened 1 year ago
Unfortunately this is not as simple as changing which CUSPARSE_SPGEMM_ALG*
flag is used.
Here are a few notes if anyone wants to pick this up:
Comparing the example using the default algorithm to the example using a new one. The preamble also has to change with the new algorithm requiring some additional buffer allocations, calls to work-estimation, and calls to memory estimation APIs. Interestingly you no longer have to call cusparseSpGEMM_compute
twice which is nice!
The example using the new algorithm is specifically using CSR_ALG3
. This does not support batched computation, it also does not support all data types. CSR_ALG2
is also supposed to alleviate the issues with requiring too much resources, but without an example or any documentation on what the setup steps need to be there would be some trial/error to figure out if that setup is closer to the default, or ALG3 option.
While cuda 11 is within our support envelope the code-path using the current algorithm has to be maintained for those builds as the newer algorithms are not available.
It would also be nice to have a performance comparison I would assume that if the algorithm is more memory efficient it must sacrifice performance in some way, that would make the heuristic for when to activate it more complicated.
🚀 The feature, motivation and pitch
The SpGEMM algorithm in cuda 11.x version requires high amount of memory for the sparse computation. In CUDA 12, two new SpGEMM algorithms has been introduced to resolve the problem. I really hope that the new algorithms can be integrated to pytorch (Providing a solution to use the new algorithms is also exciting : ) ). Thanks. Please see https://github.com/NVIDIA/CUDALibrarySamples/issues/38.
Alternatives
No response
Additional context
No response
cc @alexsamardzic @nikitaved @pearu @cpuhrsch @amjames @bhosmer @ptrblck