The inter node/distributed memory is solved using MPI, but there is room for optimisation within a node, i.e. using multi-threadding/OpenMP.
A basic idea from me was that mval[phase] is calculated by getting data from smaller memory addresses, i.e. to calculate mval[i] we need mval[k1]..mval[kn] where k1 < .. < kn < i, so by making the distance betwee kn and i large, (for every i) we can do multi threaded parallelism.
Another source of acceleration is of course the material we received from Demmel.
The inter node/distributed memory is solved using MPI, but there is room for optimisation within a node, i.e. using multi-threadding/OpenMP.
A basic idea from me was that
mval[phase]
is calculated by getting data from smaller memory addresses, i.e. to calculatemval[i]
we needmval[k1]
..mval[kn]
wherek1 < .. < kn < i
, so by making the distance betweekn
andi
large, (for everyi
) we can do multi threaded parallelism.Another source of acceleration is of course the material we received from Demmel.