helmholtz-analytics / heat

Distributed tensors and Machine Learning framework with GPU and MPI acceleration in Python
https://heat.readthedocs.io/
MIT License
210 stars 53 forks source link

Warning: Process-local linalg 32-bit only #1112

Closed ClaudiaComito closed 1 year ago

ClaudiaComito commented 1 year ago

This is a heads-up for potential future problems and a place to track related developments.

As @mrfh92 has experienced (and reported) while working on his distributed SVD experiment, PyTorch uses 32-bit blas libraries.

That is, for the foreseeable future, Heat can perform parallel linear algebra on humongous matrices only as long as the process-local slices don't contain more elements than a 32-bit integer can represent.

To be discussed:

mrfh92 commented 1 year ago

I had a further look into this problem: it seems that actually not the number of elements in the local array is what has to be bounded by the maximum32-bit int, but instead the size of the potential workload has to be bounded by this number. Hence, I guess that routines that require rather much memory (e.g. SVD) will fail earlier than those requiring less memory (e.g. matmul). In particular, printing a warning will be difficult because failure does not only depend on the size of the local arrays alone but also on the operations we want to perform on it.

So, from my point of view the actions to be taken are:

mrfh92 commented 1 year ago

There is currently no way to circumvent this issue (see linked PyTorch issue above) and also no possibility to generate a reasonable warning since the actual value of the workload size is not well-documented.

Therefore closed within #1109