Therefore, the code is limited by the cache memory, and for large problems, as every process (worker) needs a copy of all the variables the memory gets full and the problems is then not cpu bounded anymore (as we get no speed up at all).
Depending on the function that is used, it seems that the problem gets worse (some functions must use more memory space). Gmres seems to worsen the cache memory limitations but sparse solve doesn't.
Multithreading would solve this problem as all parallelized steps use the same variables (shared memory) and they are only used for reading purposes (no access exclusion therefore speed up should be seen).
2.- The python code is not fast enough: the slowest parts of the code are: scipy gmres and _calculateMatrix function. The gmres code has internally a python loop (that could be the cause of it being slower). A solution could be using a compiled code (or maybe they really can't be faster).
Our code doesn't speed up enough to be competitive with the scipy solver. Two problems here:
1.- The code doesn't speed up: The problem is the cache memory limitation, because we are using multiprocessing. See last comment (mine) of this post:
http://stackoverflow.com/questions/29358872/inefficient-multiprocessing-of-numpy-based-calculations
Therefore, the code is limited by the cache memory, and for large problems, as every process (worker) needs a copy of all the variables the memory gets full and the problems is then not cpu bounded anymore (as we get no speed up at all).
Depending on the function that is used, it seems that the problem gets worse (some functions must use more memory space). Gmres seems to worsen the cache memory limitations but sparse solve doesn't.
Multithreading would solve this problem as all parallelized steps use the same variables (shared memory) and they are only used for reading purposes (no access exclusion therefore speed up should be seen).
2.- The python code is not fast enough: the slowest parts of the code are: scipy gmres and _calculateMatrix function. The gmres code has internally a python loop (that could be the cause of it being slower). A solution could be using a compiled code (or maybe they really can't be faster).