Upgrading SpGEMM algorithm to resolve Cusparse SpGEMM insufficient resources problem

Unfortunately this is not as simple as changing which CUSPARSE_SPGEMM_ALG* flag is used.

Here are a few notes if anyone wants to pick this up:

Comparing the example using the default algorithm to the example using a new one. The preamble also has to change with the new algorithm requiring some additional buffer allocations, calls to work-estimation, and calls to memory estimation APIs. Interestingly you no longer have to call cusparseSpGEMM_compute twice which is nice!
The example using the new algorithm is specifically using CSR_ALG3. This does not support batched computation, it also does not support all data types. CSR_ALG2 is also supposed to alleviate the issues with requiring too much resources, but without an example or any documentation on what the setup steps need to be there would be some trial/error to figure out if that setup is closer to the default, or ALG3 option.
While cuda 11 is within our support envelope the code-path using the current algorithm has to be maintained for those builds as the newer algorithms are not available.

It would also be nice to have a performance comparison I would assume that if the algorithm is more memory efficient it must sacrifice performance in some way, that would make the heuristic for when to activate it more complicated.

pytorch / pytorch

Upgrading SpGEMM algorithm to resolve Cusparse SpGEMM insufficient resources problem #103820

🚀 The feature, motivation and pitch

Alternatives

Additional context