Rows can be exchanged without modifying the order of the DOFs. Use this proberty to form data blocks in order to minimize the number of necessary registers per row in the sparse matrix.
For sse optimizations it might be advantageous to choose different assembly and sparse matrix approaches depending on the number of DOFs and the problems dimension. Some suggestions for systems based on 32 bit floats:
1D systems are diagonal systems which share max 1 point with another element. The number of DOFs only change the bandwidth of the diagonal system. This properties should be utilized by a special data structure and solver.
2 DOFs per node: Data of 2 nodes fit into one 128 bit register. If the global system keeps the DOFs together (x0, y0, x1, y1, x2, y2 etc) The number of different shuffle operations will be minimized.
3 DOFs per node: The best solution seems to be to keep the nodal DOFs together inside of one 128 bit lane and to ignore the 4th value during a first assembly step. After the contributions of each element are applied to the DOFs, the registers are shuffled to remove the zeros and maximize solving efficiency. Every 4 registers will transform to just 3: (1,2,3,0), (4,5,6,0), (7,8,9,0), (10,11,12,0) ---> (1,2,3,4), (5,6,7,8), (9,10,11,12). Still needs some adjustments if constraints are involved
** Solver
When rows need to be exchanged across register boundaries during the pivoting step, move the current pivot element to the top of its register and swap the registers instead of exchanging the exact rows Won't work, since it will move rows that are not fully solved (1 on the main diagonal, 0 everywhere else) above the row of the current pivot element. Can only be used if the currently processed row is the the first element of a register (can still be utilized as special case)
During pivoting step: Check if it is faster to compare whole registers and use the result to update an index register and a max value register (use blend function) instead of serial comparison like in the Mat4 version. Afterwards the index must still be determined by serial processing from the max value register.
**Assembly
Rows can be exchanged without modifying the order of the DOFs. Use this proberty to form data blocks in order to minimize the number of necessary registers per row in the sparse matrix.
For sse optimizations it might be advantageous to choose different assembly and sparse matrix approaches depending on the number of DOFs and the problems dimension. Some suggestions for systems based on 32 bit floats:
** Solver
When rows need to be exchanged across register boundaries during the pivoting step, move the current pivot element to the top of its register and swap the registers instead of exchanging the exact rowsWon't work, since it will move rows that are not fully solved (1 on the main diagonal, 0 everywhere else) above the row of the current pivot element. Can only be used if the currently processed row is the the first element of a register (can still be utilized as special case)