Warning: Process-local linalg 32-bit only

ClaudiaComito commented 1 year ago

This is a heads-up for potential future problems and a place to track related developments.

As @mrfh92 has experienced (and reported) while working on his distributed SVD experiment, PyTorch uses 32-bit blas libraries.

That is, for the foreseeable future, Heat can perform parallel linear algebra on humongous matrices only as long as the process-local slices don't contain more elements than a 32-bit integer can represent.

To be discussed:

implications / warning for the users
return a meaningful error
potential alternatives?

mrfh92 commented 1 year ago

I had a further look into this problem: it seems that actually not the number of elements in the local array is what has to be bounded by the maximum32-bit int, but instead the size of the potential workload has to be bounded by this number. Hence, I guess that routines that require rather much memory (e.g. SVD) will fail earlier than those requiring less memory (e.g. matmul). In particular, printing a warning will be difficult because failure does not only depend on the size of the local arrays alone but also on the operations we want to perform on it.

So, from my point of view the actions to be taken are:

verify that what I wrote above is true (e.g., by experiments)
determine which Blas/LaPack routines are used behind the scenes in PyTorch
find in PyTorch/Blas/LaPack how the workload variable is calculated (for each of these routines)
print warnings in Heat if this workload variable is expcted to get bigger than maximum 32-bit int

mrfh92 commented 1 year ago

There is currently no way to circumvent this issue (see linked PyTorch issue above) and also no possibility to generate a reasonable warning since the actual value of the workload size is not well-documented.

Therefore closed within #1109

helmholtz-analytics / heat

Warning: Process-local linalg 32-bit only #1112