Apply optimisations pernode/hybrid optimisation

The inter node/distributed memory is solved using MPI, but there is room for optimisation within a node, i.e. using multi-threadding/OpenMP.

A basic idea from me was that mval[phase] is calculated by getting data from smaller memory addresses, i.e. to calculate mval[i] we need mval[k1]..mval[kn] where k1 < .. < kn < i, so by making the distance betwee kn and i large, (for every i) we can do multi threaded parallelism.

Another source of acceleration is of course the material we received from Demmel.

vatai / mpk

Apply optimisations pernode/hybrid optimisation #27