kokkos / kokkos-kernels

Kokkos C++ Performance Portability Programming Ecosystem: Math Kernels - Provides BLAS, Sparse BLAS and Graph Kernels
Other
301 stars 95 forks source link

Parallelizing pcgsolve with openmp #1154

Open zLi90 opened 2 years ago

zLi90 commented 2 years ago

We are using the provided pcg solver in kokkoskernels for our hydrological model. We found that when we activate openmp, the scaling does not look good.

We simply call run_pcg<Kokkos::OpenMP>(domsub.nCellDomain, statesub, domsub, timers); and in run_pcg, we call:

KokkosKernels::Experimental::Example::pcgsolve(kh, matA, vecB, vecX, cg_iteration_limit, cg_iteration_tolerance, & cg_result, true, clusterSize, useSequential); Kokkos::fence();

The pcgsolve takes about 90% of the computational time and it doesn't scale well when we add threads. But since it is provided by kokkoskernels, it is difficult for us to find out what might be the reason. We were wondering if there are additional steps we are missing that can enhance scaling performance of the pcg solver?

We also want to ask if you have any recommendations on choosing an linear system solver (perhaps also a nonlinear solver) that is compatible with kokkos and has good scaling performance when using openmp and mpi?

Here I attached a figure that shows the speedup of our model (the blue dots), using the pcgsolve. Thank you so much in advance! Screen Shot 2021-10-26 at 8 42 24 PM

srajama1 commented 2 years ago

@jennloe

lucbv commented 2 years ago

Actually a good start would be to re-run this with finer timers, for instance using Kokkos-tools simple-kernel-timer. You just need to build the tool and then set: export KOKKOS_PROFILE_LIBRARY=${HOME}/kokkos-tools/profiling/simple-kernel-timer/kp_kernel_timer.so

jennloe commented 2 years ago

I will start looking at this...

jennloe commented 2 years ago

@zLi90 What size problem were you running on to get these timings? How sparse is the matrix? What was the convergence tolerance? How many iterations were you running? And what size of problem do you ultimately want to be able to run?

zLi90 commented 2 years ago

@jennloe We are running a 3D transient subsurface flow model. This test case has 40x24x32=30720 grid cells, resulting a 7-diagonal matrix. The tolerance is 1e-8 and it takes (more or less) 18 pcg iterations for each time step (we ran a total of 900 seconds with a time step dt=60s). We hope to be able to run large scale hydrological simulations that contain 1 million grid cells or more. Thanks!

zLi90 commented 2 years ago

Hi @jennloe , just wondering if you have figured out something? Do you have any suggestions on what we could try next? Thanks!

jennloe commented 2 years ago

I'm sorry; I have not had more time to look at this, and I will not have time in the next few weeks. Thank you for your patience. @brian-kelley would you be able to look into this scaling?

srajama1 commented 2 years ago

I like Luc's comment above. @zLi90 can you enable the finer timers please?

zLi90 commented 2 years ago

@srajama1 @lucbv Thanks for the suggestion! I have tried the simple-kernel-timer but I am not sure how to decode the output file (or perhaps I did something wrong?)

For example, I got an output file with the first few lines like this:

ÄÜ7 @H(Kokkos::Impl::host_space_deepcopy_doublei ∞Ç?_deepcopO/Kokkos::View::initialization [Diag for Seq SOR]#^?tion [DiB"Kokkos::View::initialization [RHS]Ä?a?tion [RHD$Kokkos::View::initialization [Vvoid]@+?tion [VvE%Kokkos::View::initialization [cg::Ap]a?tion [cgD$Kokkos::View::initialization [cg::p]Ķ?tion [cgD$Kokkos::View::initialization [cg::r]!a?tion [cgC#Kokkos::View::initialization [data]X7?tion [daC#Kokkos::View::initialization [dgw0]zH?tion [dgC#Kokkos::View::initialization [dgw1]®I?tion [dgC#Kokkos::View::initialization [dgw2]úN?tion [dgA!Kokkos::View::initialization [dh]@˘>tion [dhC#Kokkos::View::initialization [dsw0]

zLi90 commented 2 years ago

Ok I found the reader provided. This is useful indeed! Here are the top 5 time-consuming sections:

with 1 thread:

with 2 threads:

It seems that the last two items (Axpby and dot) don't scale well (from 0.2s to 0.14s)? I believe they belong to the pcgsolve of KokkosKernels.

srajama1 commented 2 years ago

@zLi90 Do you compile with any BLAS enabled? We have the implementation routines of in Kokkos Kernels. However, one should use vendor BLAS when available. What is the platform and compiler combination?

zLi90 commented 2 years ago

@srajama1 I am using Mac+gcc. I just use the default settings when building kokkos kernels. Do you mean I should add -DKokkosKernels_ENABLE_TPL_BLAS=ON when building?