Question: Tpetra: Appropriate data structure for block data

ctanis commented 6 years ago

I am getting terrible performance with Experimental::BlockCrsMatrix and MultiVec (BlockCrsVector doesn't compile). What is currently the best way to deal with multiple degrees of freedom per mesh point?

@trilinos/tpetra

Expectations

Experimental::BlockCrsMatrix.apply should be faster in parallel, or I should be doing this in a different way.

Current Behavior

Using MPI (np = 2), BlockCrsMatrix.apply is taking approximately 100 times longer than in serial (still getting the correct answer, I think).

This is applying a block matrix with 1 degree of freedom to a MultiVector (not block) with 1 vector.

Attempting to compile a simple example with BlockVector does not compile. compilation_error.txt

Motivation and Context

I understand these are "experimental" data structures, so I would love to know the most appropriate way to go forward.

Also note: I was using 1 degree of freedom per mesh point to make sure I was solving the problem correctly. My imminent need is to have 4 or more per mesh point. Potentially many more!

I'm going to attempt to add a 'question' label after submitting this. My github interface is somehow missing the label and assignee dropdowns.

mhoemmen commented 6 years ago

@ctanis What version of Trilinos are you using? Also, are you using OpenMP?

mhoemmen commented 6 years ago

@trilinos/tpetra

mhoemmen commented 6 years ago

My github interface is somehow missing the label and assignee dropdowns.

The Trilinos repository restricts assignment of labels and assignees to members.

mhoemmen commented 6 years ago

@ctanis Our experience is that BlockCrsMatrix historically was not super fast but not awful. kokkos-kernels folks are working on it.

kyungjoo-kim commented 6 years ago

@ctanis We have an example code to measure performance of block crs matrix vector multiplication. After you compile Trilinos, you can find out an executable packages/tpetra/core/example/BlockCrs/TpetraCore_BlockCrsPerfTest.exe. Testing the code with

7 point 3D stencil on a 32x32x32 block mesh
my workstation xeon 2x8 intel sandy bridge
OMP_NUM_THREADS=1 (mpi only),
blocksize =5. I have the following result. Could you verify that you also get the same performance as I have seen below ? The code compares the block crs apply method against the point crs apply method.

[kyukim @bread] BlockCrs > mpirun -np 1 ./TpetraCore_BlockCrsPerfTest.exe --num-elements-i=32 --num-elements-j=32 --num-elements-k=32 --num-procs-i=1 --num-procs-j=1 --num-procs-k=1
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
Kokkos::OpenMP thread_pool_topology[ 1 x 1 x 1 ]
Column = 0  Error norm = 0
====================================================================================

                        TimeMonitor results over 1 processor

Timer Name                                               Global time (num calls)
------------------------------------------------------------------------------------
0) LocalGraphConstruction                                0.04422 (1)
1) GlobalGraphConstruction                               0.002842 (1)
2) LocalBlockCrsFill                                     0.3604 (1)
3) BlockCrsMatrix FillComplete - currently do nothing    9.537e-07 (1)
4) MultiVectorFill                                       0.006302 (1)
5) BlockCrs Apply                                        7.918 (100)
6) Conversion from BlockCrs to PointCrs                  0.155 (1)
7) PointCrs Apply                                        10.02 (100)
8) Export MatrixMarket                                   21.14 (1)
X) Global                                                39.71 (1)
====================================================================================
[kyukim @bread] BlockCrs > mpirun -np 2 ./TpetraCore_BlockCrsPerfTest.exe --num-elements-i=32 --num-elements-j=32 --num-elements-k=32 --num-procs-i=2 --num-procs-j=1 --num-procs-k=1 
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set                                                                                                        
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
Kokkos::OpenMP thread_pool_topology[ 1 x 1 x 1 ]
Column = 0  Error norm = 0
==================================================================================================================================

                                              TimeMonitor results over 2 processors

Timer Name                                               MinOverProcs     MeanOverProcs    MaxOverProcs     MeanOverCallCounts
----------------------------------------------------------------------------------------------------------------------------------
0) LocalGraphConstruction                                0.0214 (1)       0.02141 (1)      0.02143 (1)      0.02141 (1)
1) GlobalGraphConstruction                               0.006952 (1)     0.006968 (1)     0.006985 (1)     0.006968 (1)
2) LocalBlockCrsFill                                     0.1801 (1)       0.1804 (1)       0.1807 (1)       0.1804 (1)
3) BlockCrsMatrix FillComplete - currently do nothing    9.537e-07 (1)    1.073e-06 (1)    1.192e-06 (1)    1.073e-06 (1)
4) MultiVectorFill                                       0.003153 (1)     0.003159 (1)     0.003164 (1)     0.003159 (1)
5) BlockCrs Apply                                        4.955 (100)      4.955 (100)      4.955 (100)      0.04955 (100)
6) Conversion from BlockCrs to PointCrs                  0.122 (1)        0.1221 (1)       0.1221 (1)       0.1221 (1)
7) PointCrs Apply                                        5.039 (100)      5.039 (100)      5.039 (100)      0.05039 (100)
8) Export MatrixMarket                                   1.883 (1)        11.92 (1)        21.96 (1)        11.92 (1)
X) Global                                                12.24 (1)        22.28 (1)        32.32 (1)        22.28 (1)
==================================================================================================================================

BTW, if you use --blocksize=1, then I get the following poor performance. I think that the single DOF per block is not an appropriate use case with block crs.

[kyukim @bread] BlockCrs > mpirun -np 2 ./TpetraCore_BlockCrsPerfTest.exe --num-elements-i=32 --num-elements-j=32 --num-elements-k=32 --num-procs-i=2 --num-procs-j=1 --num-procs-k=1 --blocksize=1
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
Column = 0  Error norm = 0
==================================================================================================================================

                                              TimeMonitor results over 2 processors

Timer Name                                               MinOverProcs     MeanOverProcs    MaxOverProcs     MeanOverCallCounts
----------------------------------------------------------------------------------------------------------------------------------
0) LocalGraphConstruction                                0.0214 (1)       0.02147 (1)      0.02154 (1)      0.02147 (1)
1) GlobalGraphConstruction                               0.006705 (1)     0.006769 (1)     0.006833 (1)     0.006769 (1)
2) LocalBlockCrsFill                                     0.0374 (1)       0.0375 (1)       0.0376 (1)       0.0375 (1)
3) BlockCrsMatrix FillComplete - currently do nothing    0 (1)            4.768e-07 (1)    9.537e-07 (1)    4.768e-07 (1)
4) MultiVectorFill                                       0.0006309 (1)    0.000634 (1)     0.0006371 (1)    0.000634 (1)
5) BlockCrs Apply                                        3.347 (100)      3.347 (100)      3.347 (100)      0.03347 (100)
6) Conversion from BlockCrs to PointCrs                  0.01295 (1)      0.01301 (1)      0.01307 (1)      0.01301 (1)
7) PointCrs Apply                                        0.3093 (100)     0.3093 (100)     0.3094 (100)     0.003093 (100)
8) Export MatrixMarket                                   0.2525 (1)       0.6842 (1)       1.116 (1)        0.6842 (1)
X) Global                                                3.995 (1)        4.426 (1)        4.858 (1)        4.426 (1)
==================================================================================================================================

ctanis commented 6 years ago

@mhoemmen I'm using a master head from the weekend.. (commit c6810b7889a4e7e8988c0dac587ccc6307acf7f9).

It's been configured with this script:
do-configure.txt

I am using openmp and cuda at various times (as well as MPI, obviously). Attempting to use 1 MPI rank per compute node, and then OpenMP and Cuda intra-node.

ctanis commented 6 years ago

@kyungjoo-kim I cannot find this perf test executable or source in my clone of the git repository (Trilinos) or in the build directory. Do I need to do something special to build this?

kyungjoo-kim commented 6 years ago

@ctanis You can find out the code here,

https://github.com/trilinos/Trilinos/tree/master/packages/tpetra/core/example/BlockCrs

please set EXAMPLES=ON in your do-configure.txt.

This performance test is develpped to identify performance issues and set a baseline performance when we develop a new block crs data structure and an improved apply method. It simply compares block crs and point crs apply methods.

mhoemmen commented 6 years ago

To clarify: Tpetra_ENABLE_EXAMPLES=ON

mhoemmen commented 6 years ago

@ctanis You set Trilinos_ENABLE_DEBUG:BOOL=ON in the script. This enables a bunch of debug checking that will slow down the code quite a bit.

mhoemmen commented 6 years ago

@ctanis You normally don't need to set that option at all; just setting CMAKE_BUILD_TYPE will automatically set an appropriate default for Trilinos_ENABLE_DEBUG.

ctanis commented 6 years ago

@mhoemmen I have a non-debug build that shows similar timing. I'm going to test the Perf script mentioned above tomorrow when I return to work. did you happen to see the compilation error I got with the BlockVector? Any ideas about that? Presumably I can't use a regular MultiVector when NQ>1 (I got multiple issues here.. thanks for your help!)

mhoemmen commented 6 years ago

@ctanis wrote:

did you happen to see the compilation error I got with the BlockVector?

Yes. The obscureness of the errors suggests to me that you might be trying to use NVCC directly as the compiler, rather than nvcc_wrapper. Please refer to the Tpetra FAQ:

https://github.com/trilinos/Trilinos/blob/master/packages/tpetra/doc/FAQ.txt

ctanis commented 6 years ago

@mhoemmen

Yes. The obscureness of the errors suggests to me that you might be trying to use NVCC directly as the compiler, rather than nvcc_wrapper. Please refer to the Tpetra FAQ:

I'm definitely using nvcc_wrapper, as without any reference to BlockMultiVector everything compiles and runs on the GPU as appropriate. The only issue was the parallel performance of the BlockCrsMatrix.apply()

mhoemmen commented 6 years ago

@ctanis What version of CUDA are you using? Also, you may need to supply a KOKKOS_ARCH CMake option appropriate to your GPU. Otherwise Kokkos may not perform well. See Kokkos documentation for details.

ctanis commented 6 years ago

@mhoemmen This is CUDA 8 on a P100. I had not set KOKKOS_ARCH, so I'm rebuilding Trilinos now. I'll keep you posted. In the meantime, @kyungjoo-kim : my 2 processor version of your performance test is taking about 250 times longer than the 1 processor. see attached. I'll try again when the new version finishes building. blockmult-2.txt blockmult-1.txt

kyungjoo-kim commented 6 years ago

@ctanis Thank you for running the test.

It says that something clearly goes wrong in the cuda version block crs apply method. My test is limited on the openmp space. We need to dig further about this performance issue and it would take some time. Does this block your work ? or how long can you wait for the improvement ?

ctanis commented 6 years ago

@kyungjoo-kim this is pretty important to my work, but as I said I appreciate that the block structures are "experimental". Is there a way to force the linear solve (and underlying BlockCrs apply) to use OpenMP, but allow me to continue to use Cuda for my matrix assembly? I have a deadline for a paper coming up.

I ran this code on a profiler, and the time is dominated by Kokkos::deep_copy and a Cuda fence()-- I was surprised by this, since I thought Kokkos aggressively opted for UVM. Maybe I can force UVM space and see if that helps?

mhoemmen commented 6 years ago

@ctanis Tpetra has an option to let users control whether MPI tries to talk to CUDA allocations.

Could you search for Tpetra_ASSUME_CUDA_AWARE_MPI in your CMakeCache.txt file (in your Trilinos build directory) and tell me whether it's ON or OFF? If it's ON, you can set the following environment variable (case sensitive) to turn it off:

TPETRA_ASSUME_CUDA_AWARE_MPI=OFF

kyungjoo-kim commented 6 years ago

@ctanis In my personal opinion, there is no easy way to use openmp space for linear solve while the tpetra object still lives in cuda uvm space.

The block crs matrix has templarate parameters scalar_type, local_ordinal, global_ordinal, and node_type. So, it might be possible by giving the user provided non default node_type. Probably it does not work in that way.

The other possible way is to construct two object (one for cuda and the ohter for openmp). Then, you can use cuda object for your FE assembly (I believe that this is your core work for publication) and convert the cuda object to the other one with openmp for linear solve to finish in time.

@mhoemmen Am I saying something crazy ? or feasible ?

mhoemmen commented 6 years ago

@kyungjoo-kim That's reasonable.

@ctanis We've heard that the P100 has some issues with CUDA-aware MPI. I would recommend setting the environment variable (see my above comment) to force Tpetra to use host memory for MPI communication.

ctanis commented 6 years ago

@mhoemmen TPETRA_ASSUME_CUDA_AWARE_MPI is OFF in my CmakeCache.txt. I'm going to try creating a second non-cuda block matrix for now. I'm doing a number of other things inefficiently right now, so what's one more! Please let me know if I can provide any more information.

mhoemmen commented 6 years ago

@ctanis Please don't that we don't make any grand promises about BlockCrs performance at the moment. @kyungjoo-kim and his colleagues are working on OpenMP and CUDA performance actively at the moment.

ctanis commented 6 years ago

@mhoemmen Oh, I totally understand! Is there a good way to accomplish this with the non-Block matrices and vectors?

mhoemmen commented 6 years ago

@ctanis You could convert your block matrix to a non-block matrix (plain old Tpetra::CrsMatrix) directly. Just include Tpetra_Experimental_BlockCrsMatrix_Helpers.hpp and use the relevant function (look in Tpetra_Experimental_BlockCrsMatrix_Helpers_decl.hpp for the declaration and documentation).

ctanis commented 6 years ago

@mhoemmen in case I'm missing something, I only see a way to convert from CrsMatrix to BlockCrsMatrix (the other direction) in this file?

mhoemmen commented 6 years ago

@ctanis Oops, sorry, that's right.

ctanis commented 6 years ago

I've modified all the data structures involved in the linear system to be templated on Kokkos::Compat::SerialWrapperNode ( I tried KokkosOpenMPWrapperNode as well). In all cases, the multiple process versions perform worse than the single processor. When I profile, it says that a call to cudaDeviceSynchronize() is dominating the calculation.

Should this have worked?

typedef typename Kokkos::Compat::KokkosSerialWrapperNode LSNode;
//typedef typename ::Kokkos::Compat::KokkosCudaWrapperNode LSNode;

typedef typename Tpetra::Map<LO, GO, LSNode> LSMapType;
typedef typename Tpetra::Operator<Scalar, LO, GO, LSNode> OP;
typedef typename Tpetra::CrsGraph<LO,GO, LSNode> GraphType;
typedef typename Tpetra::BlockCrsMatrix<Scalar, LO, GO, LSNode> MatrixType;
typedef typename Tpetra::MultiVector<Scalar,LO,GO, LSNode> MultiVecType;

mhoemmen commented 6 years ago

What does "modified the data structures to be templated on SerialWrapperNode" mean? Does it just mean that you're using Tpetra with the Serial execution space, instead of the Cuda execution space?

jhux2 commented 6 years ago

in case I'm missing something, I only see a way to convert from CrsMatrix to BlockCrsMatrix (the other direction) in this file?

Fwiw, you could write the BlockCRS matrix to file and read it back in as a point CRS matrix, using the Operator write method.

ctanis commented 6 years ago

@mhoemmen I used the LSNode typedef in my code snippet as the node_type in the declarations of the various linear algebra objects used in the solve. I guess I mean that applied the templates in this way -- sorry for the confusing word choice!

ctanis commented 6 years ago

Sorry for reviving this old thread, but we finished our paper and I'm back trying to get reasonable Tpetra performance with the block structures. Specifically I'm trying to use Cuda to build my system and then (for now) use OpenMP to parallelize Tpetra+Belos. Recall: block matrix-vector multiplication using MPI+Cuda is extraordinarily slow on our system.

Does Tpetra+Belos scale well with MPI+OpenMP? I'm seeing one process using 100% of the CPU, and then many processors that barely register on top (floating around 2%) when I'm running on two MPI processes. On one MPI process, there's one 100% process and then many with 10-15% CPU. I have set all the recommended OMP_PLACES, OMP_BIND, etc.

I'm starting to be suspicious that my Map data structures are not optimal (though I do get all the right answers).

(This is the second version of this comment -- the initial one seemed to be inaccurate after more tests).

mhoemmen commented 6 years ago

@jjellio can help with pinning tips. I do appreciate your patience; we're a bit swamped these days.

jjellio commented 6 years ago

The usual is OMP_PLACES=threads, OMP_PROC_BIND=spread, and OMP_NUM_THREADS The OpenMP things will not help your MPI process mapping though.

Is this with OpenMPI? Or SLURM? (I don't know the mapping for Intel MPI..)

with OpenMPI, you want something like mpirun --map-by ppr::NUMA:pe= Or, mpirun --map-by ppr::socket (you can specify pe= if you want).

For example, on my dual socket, dual NUMA domain system:

mpirun --map-by ppr:1:NUMA --report-bindings hostname
[host:96231] MCW rank 1 bound to socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt 0-1]], socket 1[core 20[hwt 0-1]], socket 1[core 21[hwt 0-1]], socket 1[core 22[hwt 0-1]], socket 1[core 23[hwt 0-1]], socket 1[core 24[hwt 0-1]], socket 1[core 25[hwt 0-1]], socket 1[core 26[hwt 0-1]], socket 1[core 27[hwt 0-1]]: [../../../../../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
[host:96231] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]]: [BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../../../../../..]
host
host

The piece you want to look for, is:

[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../../../../../..]
[../../../../../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]

That is an ASCII visualization of the CPUMASK applied to each process, i.e., which Cores will the MPI process be allowed to run on.

You may notice, I didn't specify an -np, and OpenMPI simply filled things up following the mapping. You can map-by quite a few things. (check the Manpage). The natural ones are socket or NUMA, if you run on Intel Phi with subnuma clustering, the map-by NUMA feature is very nice. For OpenMP on traditional things, map-by socket or core may be better.

If your MPI Run doesn't report bindings correctly, it probably wasn't compiled with hwloc correctly... (this happens alot unfortunately).

You can verify the MPI reported bindings like:

mpirun --map-by ppr:1:NUMA --display-map bash -c 'echo $OMPI_COMM_WORLD_RANK $(cat /proc/$$/status | grep Cpus_allowed_list)'

0 Cpus_allowed_list: 0-13,28-41
1 Cpus_allowed_list: 14-27,42-55

That's some ugly (probably not safe) bash execution, but what it will do is use mpirun to spawn off bash processes. The processes then inspect their own /proc/PID/status, and report the CPU list of themselves.... there are other ways to do it.

There are also a number of OMPI_* env variables, can you see them by mpirun -np 1 env, or maybe in the MPI man pages (somewhere buried most likely).

Compare the output of the CPUMASK stuff to tools like:

numactl -H

hwloc-ls

If you are on a desktop with X11, I highly recommend,

lstopo

(Part of Hwloc), it will print a nice picture showing your machines topology.

ctanis commented 6 years ago

@jjellio Thanks for this.. I'm seeing something a little different here. If I don't use a --bind-to clause, it binds each rank to a single core. Have you seen this? Am I going to have ill effects from using bind-to? I'm going to test the performance myself, but I'm having to rebuild Trilinos with a different version of MPI (You were right.. I was using one with a broken hwloc)

os-hn:~$ mpirun --map-by ppr:1:numa -machinefile /scr/ctanis/kkfe/machine1.txt  bash -c 'echo $OMPI_COMM_WORLD_RANK $(cat /proc/$$/status | grep Cpus_allowed_list)'
0 Cpus_allowed_list: 0
1 Cpus_allowed_list: 1
os-hn:~$ mpirun --map-by ppr:1:numa --bind-to numa -machinefile /scr/ctanis/kkfe/machine1.txt  bash -c 'echo $OMPI_COMM_WORLD_RANK $(cat /proc/$$/status | grep Cpus_allowed_list)'
0 Cpus_allowed_list: 0,2,4,6,8,10,12,14,16,18,20,22,24,26
1 Cpus_allowed_list: 1,3,5,7,9,11,13,15,17,19,21,23,25,27

jjellio commented 6 years ago

Yes, you will have issues.

You don't want your process masks to look like the bind-to NUMA one. That mapping is giving you a weird looking mapping of CPUids. Normally, the cpuids will follow the pattern: core0_ht0, core1_ht0, ....,coreN_ht0, core0_ht1, ...

I can't think of a reason you would want MPI processes sharing a core.

what is the output of hwloc-ls on a single node. That will tell me what the mapping should be. (or numactl -H)

ctanis commented 6 years ago

On this node, the cpu ids are staggered on each NUMA node:

numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26
node 0 size: 65440 MB
node 0 free: 50317 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27
node 1 size: 65536 MB
node 1 free: 59180 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10

jjellio commented 6 years ago

That is very odd. Usually an entire socket will get lumped into a NUMA region (since the CPU effectively has the memory controller), and cpuids are usually numbered by socket. Some AMD chips pack two CPUs (dies) onto a single socket, and the OS sees two memory regions.

I'd be skeptical about numactl on that machine. It could correct. Either way,

set OMP_DISPLAY_ENV=verbose, (and whatever you want for threads and places, proc-bind=spread)

I'd run two processes on that node, and you will get two burps of OMP output at the start. You want to see what OMP_PLACES looks like in that output. With GNU OpenMP, PLACES will get expanded to sets of CPU ids e.g., {0,10}, {1,11}, etc... if you have PLACES=cores, and each core has two hardware threads.

ctanis commented 6 years ago

It looks like I'm distributing threads properly.. I can get 100% CPU utilization across all cores with a dummy program that grinds away with a compute-intensive Kokkos::parallel_for

With the same launch settings, this Tpetra solve reverts to the described situation where one core is pegged, and the others do nothing. As an experimetn, I set OMP_WAIT_POLICY to active, and all the cores are firing at 99% . This, of course, does not speed up the solution, but is it a clue perhaps about the slow solve?

Thanks

mhoemmen commented 6 years ago

@ctanis I should ask the obvious question: What's the Node type? Are you using Tpetra's default Node type? What is it? (Try printing typeid(Tpetra::Map<>::device_type::execution_space).name(), or typeid(Node::device_type::execution_space).name() if you are using a different Node type.

mhoemmen commented 6 years ago

@ctanis You could also try some simpler Tpetra benchmarks, like the Vector benchmark:

https://github.com/trilinos/Trilinos/blob/develop/packages/tpetra/core/example/advanced/Benchmarks/vector.cpp

This should illustrate whether threads are taking any effect, possibly after adjusting the command-line arguments a bit.

jjellio commented 6 years ago

Setting OMP_WAIT_POLICY=active, and seeing all your cores fired up, simply means OpenMP started. The utilization is simply the OpenMP threads busy waiting.

If you have hardware (hyper/SMT/..) threads then 'active' is usually a bad choice, because it keeps all hardware threads on the core at 100%. Then, they fight each other to busy wait. (degrades performance terribly on many core architectures. If you don't have your OpenMP threads bound to multiple HTs on a core, then active may give a small boost to performance)

ctanis commented 6 years ago

@jjellio I suggested it as a hint because when I am not careful with the map-to and bind settings, OMP_WAIT_POLICY=active only lights up one core. I can see all 28 cores peg now, so that suggests that I'm at least launching the MPI job properly.

I am even seeing reasonable thread scaling with a pure OpenMP build now, but when I build a hybrid that uses Kokkos::Cuda for building my linear system and then a Kokkos::OpenMP to solve it, the thread bindings seem to be affected by the presence of Cuda. To answer your question, @mhoemmen , in this case the execution_space.name()'s are N6Kokkos4CudaE and N6Kokkos6OpenMPE respectively (using CudaUVMSpace as a memory space). Could there be a problem using CudaUVMSpace from OpenMP inside Tpetra? I am doing this in my own code and it behaves as expected.

I'm going to try to see how well this works with the traditional (non block) Tpetra structures.

jjellio commented 6 years ago

I don't think this is a binding issue.

I've used code like the following to inspect the objects I get at runtime. You can dump your solvers/datatypes into Teuchos::demangleName(typeid( ).name()) and get a feel for what types you have hanging around. My guess is you have a mismatch. But with mixed Execution spaces I'm not really sure.

  // These are the template parameters used to instantiate some objects
  ParameterList configPL ("Configuration");
  ParameterList& ptypePL = configPL.sublist("Primitive Types", false, "Primitive Type Information");
  ptypePL.set("Scalar", Teuchos::demangleName(typeid(SC).name()));
  ptypePL.set("LocalOrdinal", Teuchos::demangleName(typeid(LO).name()));
  ptypePL.set("GlobalOrdinal", Teuchos::demangleName(typeid(GO).name()));
  ptypePL.set("Node", Teuchos::demangleName(typeid(NO).name()));

  // these are types, e.g. MultiVector<SC,LO,GO,NO> 
  ParameterList& typePL = configPL.sublist("Type Information", false, "Type Information");
  typePL.set("MV", Teuchos::demangleName(typeid(MV).name()));
  typePL.set("Map", Teuchos::demangleName(typeid(map_type).name()));

configPL.print(<some ostream>)

jjellio commented 6 years ago

Another debug option: If you want to eliminate MPI,

(copy one list of CPUs from a numa domain)

cpu_list="0 2 4 6 8 10 12 14 16 18 20 22 24 26"
numa_domain=0
export OMP_NUM_THREADS=$(echo "${cpu_list}" | wc -w)
... set places, proc bind
numactl -C $(echo "${cpu_list}" | tr ' ' ',') --membind=${numa_domain} ./your_binary --your-args ...

That will bind your process to the cpus in the list, and memory bind your process to specific NUMA domain. I doubt this will make a difference. But it could eliminate MPI and NUMA effects as possible problems.

Also, I assume you are setting: export CUDA_LAUNCH_BLOCKING=1 export CUDA_MANAGED_FORCE_DEVICE_ALLOC=1

Kokkos will whine if you don't.

mhoemmen commented 6 years ago

@ctanis wrote:

Could there be a problem using CudaUVMSpace from OpenMP inside Tpetra?

CudaUVMSpace allocations live on the GPU. If you try to access them from multiple CPU threads, you probably won't get the memory bandwidth that you would normally get with Kokkos CPU allocations.

mhoemmen commented 6 years ago

@ctanis wrote:

but when I build a hybrid that uses Kokkos::Cuda for building my linear system and then a Kokkos::OpenMP to solve it

This only makes sense if using a host-only solver, say domain decomposition with a sparse direct subdomain solver.

ctanis commented 6 years ago

@mhoemmen

CudaUVMSpace allocations live on the GPU

I thought they lived in both? In fact I just did an experiment that used OpenMP+HostSpace kernel and it worked on CudaUVMSpace Views without any warnings or problems.

I'm highly suspicious now of the distributed graph / maps I'm using for my matrix. Recall that the performance is amazingly bad when MPI is involved. Do you have any tips for verifying that my distributed objects are optimal? I build a final graph from my own compressed row indices and use that for the BlockCsrMatrix. The number of nonzeros on each row is fixed, etc.

This only makes sense if using a host-only solver, say domain decomposition with a sparse direct subdomain solver.

Yeah the hybrid approach here is only because the MPI performance of the linear solve is so terrible when Cuda is involved.

mhoemmen commented 6 years ago

@ctanis wrote:

Do you have any tips for verifying that my distributed objects are optimal?

How do you build the graph's row Map?

@ctanis wrote:

@mhoemmen wrote:

CudaUVMSpace allocations live on the GPU I thought they lived in both? In fact I just did an experiment that used OpenMP+HostSpace kernel and it worked on CudaUVMSpace Views without any warnings or problems.

CudaUVMSpace allocations live on the GPU, but are accessible from host.

What actually happens is that the CUDA run-time library interacts with virtual memory, to push memory pages back and forth between CPU memory and GPU memory, depending on where they are accessed. It's incorrect (and may cause a segfault and/or incorrect results) to access UVM allocations on the CPU, while a GPU kernel is concurrently writing to those allocations.

ctanis commented 6 years ago

@mhoemmen

How do you build the graph's row Map?

I'm glad you asked this, as I've been revisiting this part of my code today. I am hoping that part of my problem is that I've misunderstood something and used a naive column map in my graph as well. The row map is a contiguous map based on the range of global ids owned by each rank. I used a column map to map local indices to phantom nodes owned by other processes, but I think this is causing a lot more communication than necessary during the matrix-vector multiply.

Logically I have everything that I need on each MPI rank, but the matrix needs to know about vector data on other elements. How should this be done? I couldn't find a good example of this with an unstructured problem.

CudaUVMSpace allocations live on the GPU, but are accessible from host.

How does this work with GPU page faults where UVM can be larger than GPU memory? Your point on simultaneous access is well-taken!

Thanks for all your help.

trilinos / Trilinos