New many core systems. - Githubissues

rainwoodman commented 9 years ago

The next generation of Intel will have something like 70+ cores (Knight Landing); mass deployment to major computing facilities will be next year.

It may be a good case if PFFT can both scale out and scale in.

mpip commented 9 years ago

You are right. At the moment we try to figure out how good the OpenMP support of FFTW works. We have some problems with the scaling of threaded matrix transposition that is part of the FFTW guru interface. It seems like this step does not scale at all and we have to find a way to fix it.

rainwoodman commented 9 years ago

Could you point to me where this happens? I only could find a few references to guru in kernel/sertrafo.c . But none seems to be transpose .

mpip commented 9 years ago

kernel/sertrafo.c is the right place. The local data tranpositions are hidden in the calls of fftw_plan_guru64... Those are happening in plan_remap and plan_trafo. The guru interface takes some input strides and output strides as parameters. If those are different, a local transposition will happen. A better explanation can be found in the FFTW user manual.

LadaF commented 9 years ago

Are there any news about this or some suggestions? I may have time to do some work in this in my new project.

rainwoodman commented 9 years ago

Hi, it's great you can look into this.

I don't have any news. My impression was the FFTW part that was threaded was almost embarrassingly parallel. Last time I checked, the transpose in FFTW doesn't seem to use threads at all. (Please verify this)

mpip commented 9 years ago

PFFT uses FFTW for the computation of local FFTs, local memory transpositions (that are necessary to ensure the right memory order for MPI Alltoall) and global (parallel) memory transpositions. The master branch of PFFT already supports OpenMP builds via configure option --enable-openmp. This switches on OpenMP support of FFTW in all steps. However, FFTW gives almost no speedup in the local and global transposition steps. In many cases the communication part is very time consuming. In addition, there some special features of PFFT that do not have OpenMP support at the moment. For example,

pruned PFFT
ghost cell exchange and reduce

For pruned FFT, we have to reorder an contiguous array into a strided array. I'm not sure if this can be done efficiently with OpenMP.

For ghost cell send we have to the same. In addition, the MPI communication will not benefit from more threads.

mpip commented 9 years ago

Does anybody know if a local matrix decomposition can benefit from shared memory parallelism at all? I suspect that the memory bandwidth is the main bottleneck and we can not gain any speedup with additional computing cores.

mpip commented 9 years ago

The only benefit that we get from more threads is that we have less MPI processes at constant count of computing cores. Therefore, the MPI communication is less fragmented. Correct me if I am wrong.

rainwoodman commented 9 years ago

I do not know if a local matrix decomposition can benefit from shared memmory parallelism.

On your second point, I think it makes sense. We are still gaining from a less fragmented communication pattern.

LadaF commented 9 years ago

I think the local transposition can still benefit even if memory bound as modern CPU's have more than one memory bus (and we often have two sockets per node). Anyway, I am testing this now for cases, where I do my own transpositions using MPI_AllToAllV instead of PFFT (that happens when I have sequence of different kinds of DFFT). I will probably get to the 3D PFFT cases in a couple of days or weeks.

LadaF commented 9 years ago

So the non-threaded portions are in transpose_chunks and transpose_toms513 (or whatever size is applicable). At least for the 3D r2r transform I tried. The time is spent mostly in calls to memcpy and MPI_Sendrecv. With just two MPI processes on my workstation they do not take much time, but that will probably change when there are more of them.

Whether it is worth to do anything about them I am still not sure.

rainwoodman commented 9 years ago

Wow. You have access to great tools.

My experience with memcpy is that if the data chunk is large, threading can make it linearly faster. It may not have been a pure experience (aka, likely not through memcpy_sse2_unaligned).

The MPI_SendRecv may benefit from replacing it with several non-blocking operations if there are many threads in a core. I do not have any intuition on that; it depends on the hardware.

If it is useful, I have access to regular (centos 6/ x86_64) machines with 64 cores. But how do I compile the code to produce a log file that you can use?

Is it possible you compile and send me the binary to run?

LadaF commented 9 years ago

Hi, the tool is the free Oracle Solaris Studio Performance Analyzer. I like it for its nice graphical output even when you use different compilers. It is compatible with OpenMPI at least on my OpenSUSE machine. The CentOS might be even among those which are officially supported.

The only issue was that I had to also recompile the FFTW with CFLAGS="-ggdb" to get any stack traces from it.

I don't know if sending you the binary will work because of possible incompatibilities in GLIBC or MPI but we could have a try.

One then just calls the collect command from the Solaris Studio as `OMP_NUM_THREADS=8 collect -p 1 mpirun -n 8 ./a.out".

LadaF commented 9 years ago

What would be useful, if you got the profiler going and FFTW having -ggdb, would be to try the

pfft/tests/openmp/simple_check_c2c_omp.c with larger array size (I used 512) and appropriate number of MPI processes and OpenMP threads.

For me the compile

mpicc -fopenmp -std=c99 simple_check_c2c_omp.c -lpfft -lpfft_omp -lfftw3 -lfftw3_mpi -lfftw3_omp

and run

OMP_NUM_THREADS=2 collect -p 1 mpirun -n 2 ./a.out -pfft_omp_threads 2 -pfft_runs 1

produced a usable input.

The annoying thing about Solaris Studio is that the GUI requires Oracle's Java, but I think you don't need it for the collect command.

The collect produces a directory named test.n.er which contains the profiling data and can be examined by analyzer. They also have a command line tool er_print, but I don't know how to get the data from the right part only (disregarding the planning phase is important).

rainwoodman commented 9 years ago

Let me give this a try.

Also, I looked up MPI specs and it doesn't guarantee the concurrency of non-blocking send/recv. it is quite possible they are put into a queue and ran one after another. It may be worthwhile to look into replacing MPI_Sendrecv with MPI_Win and friends.

mpip / pfft

New many core systems. #6