[MPI] Distributed global order array computation

Hi all,

In this PR, I implemented another method for the computation of the global order array in a distributed memory context.

Here is how the method works: first, a new vector is created, composed of structs with the following attributes: the scalar value of a vertex, its global identifier and the rank of its owner. This vector is sorted using the distributed sort psort (see the file psort.h for more details). This means that the struct value will move accross processes, until all the smallest values are located on process 0, then process 1 and so on and so forth. Then the global order of the sorted vertices can be computed, using its local offset (equal to the number of vertices on all the processes of rank lower than the current process) and the local order of the vertex. The global identifiers and the global order of the vertices are then sent back to the owner process. Finally, the global order for the ghosts are exchanged between processes using the method exchangeGhostVertices.

This method can become costly in terms of memory when each process owns a few hundred million vertices. To ensure the computation will run, an option ChunkSize was introduced. This will force the computation of the global order and the following communications to be performed on segments of the sorted vertices, limiting memory use. This causes a slow down in the execution but ensures the code will work with limited memory. ChunkSize is the size of the segment of sorted vertices to be processed.

A new compilation variable was added: TTK_ENABLE_MPI_RANK_ID_TIME. When set to true, the type of the variable storing ranks of processes in the algorithm ArrayPreconditioning is integer, otherwise it is char.

The time performance of this method is good: for the Backpack, Wavelet, Elevation, Isabel and Random datasets (resampled to $512^3$), on 16 processes with 24 threads eac h, the execution takes less than a second, accelerating the computation up to 50 times, with most of the time being spent in the distributed sort as expected.

When executing on the AT dataset resampled to $2048^3$, this method takes 13.4 seconds on 64 processes with 24 threads each. The previous method took 21 minutes on 32 processes to compute for the same dataset.

Example of use:

There are to ways to trigger a global order computation: from within a TTK filter, using the usual GetOrderArray and setting its argument getGlobalOrder to true, or by calling the ArrayPreconditioning filter within a pipeline, as shown in the snippet below, for the dataset wavelet:

data = Wavelet(registrationName="wavelet")

tTKArrayPreconditioning1 = TTKArrayPreconditioning(registrationName='TTKArrayPreconditioning1', Input=data)
tTKArrayPreconditioning1.PointDataArrays = ['RTData']
tTKArrayPreconditioning1.GlobalOrderArray = 1
tTKArrayPreconditioning1.ChunkSize = 1000000000

The MPI documentation has also been updated to show which level of thread support is required (instead of just its value). Two OpenMP statements were added to the ScalarFieldSmoother filter for better efficiency.

Thanks for any feedback, Eve

topology-tool-kit / ttk

[MPI] Distributed global order array computation #994

Example of use: