Usage for OpenMP parallel

GoogleCodeExporter commented 9 years ago

Three steps to enable OpenMP parallel code:

1. Check the number of cores in your machine:
$lscpu

2. Export the number of threads for OpenMP usage
$export OMP_NUM_THREADS=X

3. Run ggeom with "-p" to turn the parallel TDM on
$./ggeom in -p
-------------------------------------------------------------------------------
For example, on my machine, 
(1) I type
$lscpu
to get a list of cpu information like:
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
CPU(s):                12
Thread(s) per core:    1
Core(s) per socket:    6
CPU socket(s):         2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 44
Stepping:              2
CPU MHz:               1600.000
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              12288K

(2) From the "lscpu" I know there are 12 cores in this machine.
Say if I want to run the program in 8 threads (less or equal than the number of 
cores), type:
$export OMP_NUM_THREADS=8

(3) Step 1 and 2 are only used to set up the OpenMP parallel environment.
To use the parallel code, run ggeom as
$./ggeom in -p
Put a additional "-p" to turn on the parallel implementation.
Must put "-p" after "in", commands like:
$./ggeom -p in
or
$./ggeom -in
or
$./ggeom -in -x
will run the program as the original version.

Btw, the memory problem with OpenBLAS is still unsolved.
I am thinking about switching to the previous BLAS/LAPACK or MKL if possible.

FYI.

Original issue reported on code.google.com by cmji...@ucdavis.edu on 28 Oct 2014 at 10:04

GoogleCodeExporter commented 9 years ago

Can we make step 1 and 2 automatic?

Original comment by iglovi...@gmail.com on 29 Oct 2014 at 5:43

GoogleCodeExporter commented 9 years ago

Yes, I have modified the code. 
Now you only need one step, which is to add "-p" in the command to turn the 
OpenMP multi-threads on.
e.g. $./ggeom -in -p
The program will check the number of cores in your machine and set the number 
of threads as the maximum.

Original comment by cmji...@ucdavis.edu on 29 Oct 2014 at 7:15

GoogleCodeExporter commented 9 years ago

[1] Makefile is broken.

----
make
Makefile:62: *** extraneous `endif'.  Stop.
----

[2] What is memory problem in OpenBLAS?

I am against switching to BLAS/LAPACK. First of all because it is 3 times 
slower https://code.google.com/p/quest-qmc/wiki/Benchmark

About multithreading... It is nice to have it in theory, but we do not really 
use it here in Physics department. We can start using it, but it will make data 
generation slower and not faster as one can expect. In 99% of the cases we need 
data for different sets of input parameters. Simulation with each set can be 
started in parallel using operation system/grid mechanisms which is much more 
efficient than using MPI, OpenMP

So, in my understanding, switching back to BLAS/LAPACK is bad idea.

Original comment by iglovi...@gmail.com on 29 Oct 2014 at 7:58

GoogleCodeExporter commented 9 years ago

[1] The bug in Makefile is fixed.

[2] The memory problem of using OpenBLAS is that when you increase the number 
of threads to execute ggeom, there is an error:
BLAS : Program is Terminated. Because you tried to allocate too many memory 
regions.

I agree that OpenBLAS has a better performance.
So if we can switch to MKL, that will be the perfect solution.
If not, we should find out a way to fix the bug above in [2].

Original comment by cmji...@ucdavis.edu on 29 Oct 2014 at 9:01

GoogleCodeExporter commented 9 years ago

If it is OpenBLAS bug it should be reported to 
https://github.com/xianyi/OpenBLAS

But I do not thing that it is bug. More likely it is our inability to figure 
out how to set up parameters properly.

Original comment by iglovi...@gmail.com on 29 Oct 2014 at 9:09

GoogleCodeExporter commented 9 years ago

[1] Another thing that I do not like is that computer decides how many threads 
to use.

I would prefer syntax similar to what linux command "make" has

----
make => 1 thread

make -j => all threads

make -j n => n threads

[2] Second question is: there was flag -D_QMC_MPI that was somehow related to 
multithreading. How this modern implementation is related to that old one?

Original comment by iglovi...@gmail.com on 29 Oct 2014 at 9:19

GoogleCodeExporter commented 9 years ago

[1] By calling the subroutine at the beginning:
call DQMC_OMP_Init(nproc)
You can specify the number of threads. e.g.
1. max threads:
$./ggeom in -p
2. 1 <= n <= max, n is the number of threads you specify: 
$./ggeom in -p n
3. original single thread
$./ggeom in

[2] No idea. I did not do anything about MPI.

Original comment by cmji...@ucdavis.edu on 29 Oct 2014 at 11:10

GoogleCodeExporter commented 9 years ago

[1] Looks really good to me. 

Could you please write wiki page, describing why do we need this, how much will 
we get if we use parallel (some plots from your presentation), and how to start 
simulation in parallel regime. And mentioning that at the moment parallel 
computations work only for Intel MKL.

[2] Chia-Chen, could you please tell us something about MPI in QUEST if you 
know? 

In my understanding, we should either absorb old multithreading ideas and 
implement them using OpenMP or get rid of the old parallel code.

Original comment by iglovi...@gmail.com on 29 Oct 2014 at 11:22

GoogleCodeExporter commented 9 years ago

As far as I know, the MPI is used in the part where statistical errors are 
estimated from simulation data.
So QUEST's MPI has nothing to do with its kernel, i.e. Green's function, and 
Monte Carlo sampling.

Original comment by cxc639 on 29 Oct 2014 at 11:39

GoogleCodeExporter commented 9 years ago

Some details and conclusion about the discussion of this issue, please refer to 
the wiki page:
https://code.google.com/p/quest-qmc/wiki/Parallel_Time_Dependent_Measurement

Original comment by cmji...@ucdavis.edu on 30 Oct 2014 at 9:47

ternaus / quest-qmc

Usage for OpenMP parallel #59