OpenMP - core utilisation on multiple CPU

craig-warren commented 7 years ago

From @Mark-Dunscomb:

gprMax doesn't seem to utilize cores on a second processor when it runs. I.e. I have a computer with duel Xeon E5 8-core processors. When I run a model, all 8 cores on one of the processors are fully utilized but the other 8 on the other processor aren't touched. I've looked into possible configuration related issues with the processors but think I have them set to be incorporated together. Could the coding be involved here? It seems to me that both processors should be considered together as a single cluster and therefore no need to run additional MPI scripts for an HPC environment but... maybe they do?

Mark-Dunscomb commented 7 years ago

FYI, image of core utilization during running a model (right side of image). Core 15 was used only for snipping the image.

craig-warren commented 7 years ago

@Mark-Dunscomb can you give some more details of your machine, i.e. motherboard. gprMax detects all 2x8=16 physical cores when it prints the info on your system?

Mark-Dunscomb commented 7 years ago

@craig-warren Sure thing, here it is:

Here's the header text from a model run showing it found all 16 cores in the Host



=== Electromagnetic modelling software based on the Finite-Difference Time-Domain (FDTD) method =======================

    www.gprmax.com   __  __
     __ _ _ __  _ __|  \/  | __ ___  __
    / _` | '_ \| '__| |\/| |/ _` \ \/ /
   | (_| | |_) | |  | |  | | (_| |>  <
    \__, | .__/|_|  |_|  |_|\__,_/_/\_\
    |___/|_|
                       v3.0.17 (Bowmore)

 Copyright (C) 2015-2017: The University of Edinburgh
 Authors: Craig Warren and Antonis Giannopoulos

 gprMax is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as
  published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
 gprMax is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty
  of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.
 You should have received a copy of the GNU General Public License along with gprMax.  If not, see
  www.gnu.org/licenses.

Host: Supermicro Super Server; Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz (16 cores); 128GiB RAM; Windows 7(64-bit)```

agianno commented 7 years ago

@Mark-Dunscomb Could you try setting the KMP_AFFINITY=disabled environmental variable?

craig-warren commented 7 years ago

@Mark-Dunscomb what is the utility you are using on Windows to monitor CPU usage? Are you sure it is showing both CPUs? And not just the 8 physical cores and 8 hyper-threads for a single CPU?

Mark-Dunscomb commented 7 years ago

@craig-warren I set KPM_AFFINITY=disabled. A list of the environmental variables is below; confirm that is what you intended. The result was no change to how the CPU's are used; cores on one CPU are maxed out during modeling, cores on other CPU are only used at modeling startup/finish to save and read files. The CPU chosen to run the modeling is not always the same. I assume the choice is dependent on what other processes are active at the moment the model is started. Regardless, always one CPU is maxed out and the other is only used during the read/write portion of each iteration.

The utility I'm using is called "Core Temp". http://www.alcpu.com/CoreTemp/

Here is the list of environmental variables. See KMP on 4th and 5th line.


Python 3.5.2 |Anaconda custom (64-bit)| (default, Jul  5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.environ
environ({'WINDIR': 'C:\\Windows', 'HOMEPATH': '\\Users\\mdunscomb', 'OS': 'Windows_NT', 'COMMONPROGRAMFILES': 'C:\\Progr
am Files\\Common Files', 'PROCESSOR_ARCHITECTURE': 'AMD64', 'COMMONPROGRAMFILES(X86)': 'C:\\Program Files (x86)\\Common
Files', 'SESSIONNAME': 'Console', 'USERNAME': 'mdunscomb', 'PROCESSOR_REVISION': '3f02', 'PSMODULEPATH': 'C:\\Windows\\s
ystem32\\WindowsPowerShell\\v1.0\\Modules\\', 'ALLUSERSPROFILE': 'C:\\ProgramData', 'FP_NO_HOST_CHECK': 'NO', 'KMP_AFFIN
ITY': 'disabled', 'WINDOWS_TRACING_LOGFILE': 'C:\\BVTBin\\Tests\\installpackage\\csilogfile.log', 'COMSPEC': 'C:\\Window
s\\system32\\cmd.exe', 'PROGRAMDATA': 'C:\\ProgramData', 'PROCESSOR_LEVEL': '6', 'PROMPT': '$P$G', 'SYSTEMROOT': 'C:\\Wi
ndows', 'CONDA_PS1_BACKUP': '$P$G', 'PATHEXT': '.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC', 'USERDNSDOMAIN':
 'SEA.NET', 'PROGRAMFILES(X86)': 'C:\\Program Files (x86)', 'WINDOWS_TRACING_FLAGS': '3', 'USERDOMAIN': 'SEA', 'PROGRAMW
6432': 'C:\\Program Files', 'LOGONSERVER': '\\\\WCHS-FS', 'PUBLIC': 'C:\\Users\\Public', 'SYSTEMDRIVE': 'C:', 'LOCALAPPD
ATA': 'C:\\Users\\mdunscomb\\AppData\\Local', 'NUMBER_OF_PROCESSORS': '32', 'APPDATA': 'C:\\Users\\mdunscomb\\AppData\\R
oaming', 'PATH': 'C:\\Anaconda\\Library\\bin;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\
System32\\WindowsPowerShell\\v1.0\\;C:\\Program Files (x86)\\Windows Kits\\8.1\\Windows Performance Toolkit\\;C:\\Anacon
da;C:\\Anaconda\\Scripts;C:\\Anaconda\\Library\\bin;', 'COMPUTERNAME': '00-1510-003', 'COMMONPROGRAMW6432': 'C:\\Program
 Files\\Common Files', 'TMP': 'C:\\Users\\MDUNSC~1\\AppData\\Local\\Temp', 'HOMEDRIVE': 'C:', 'VS140COMNTOOLS': 'C:\\Pro
gram Files (x86)\\Microsoft Visual Studio 14.0\\Common7\\Tools\\', 'USERPROFILE': 'C:\\Users\\mdunscomb', 'TEMP': 'C:\\U
sers\\MDUNSC~1\\AppData\\Local\\Temp', 'PROGRAMFILES': 'C:\\Program Files', 'PROCESSOR_IDENTIFIER': 'Intel64 Family 6 Mo
del 63 Stepping 2, GenuineIntel'})
>>>```

Mark-Dunscomb commented 7 years ago

@craig-warren I'm pretty sure it's monitoring both CPUs and not just hyperthreads. To confirm, here's an image of the windows monitor, in which all threads are shown. In this case CPU No.1 (lower row of Usage images) is being used and CPU No.0 (upper row) is idle for the most part. Overall CPU usage is at 50%+/-

craig-warren commented 7 years ago

@Mark-Dunscomb Could you install hwloc from https://www.open-mpi.org/software/hwloc/v1.11/downloads/hwloc-win64-build-1.11.6.zip and run lstopo.exe mark.xml in the command prompt from the bin directory of hwloc. If you can then attach that xml file it should provide a detailed picture of how the CPU cores are set out on your machine.

This is an example from my single socket Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz. topo-iMac15,1.pdf

Mark-Dunscomb commented 7 years ago

@craig-warren Here it is: mark.zip

agianno commented 7 years ago

@Mark-Dunscomb I think it should be an OpenMP setup problem. In trials with the new Windows Subsystem for Linux (WSL) I have found that without disabling KMP_AFFINITY OpenMP will crash. The problem is that we only have VMs with Windows that are not real multi-CPU installations. We will get to the bottom of this ..

Mark-Dunscomb commented 7 years ago

@agianno Thanks.

craig-warren commented 7 years ago

@Mark-Dunscomb Attached is a diagram of the topology of your machine and also our Linux box (haggis). You can see that the core numbers (PU P#) are ordered differently. On haggis sequential core numbers (up to the number of physical cores minus one) will get one thread per core (and not use Hyper-Threading). Whereas the same sequence on your machine would use 2 threads per core, i.e. Hyper-Threading. If you can pull the latest code I like you to run a couple of tests:

Run code as is and see if there is any change from previous behaviour.
In the module input_cmds_singleuse.pycomment line 70 and uncomment line 71.

Thanks, Craig

topo-haggis.pdf topto-mark.pdf

Mark-Dunscomb commented 7 years ago

@craig-warren Interesting, I see what you mean regarding the naming convention.
I pulled the new code and ran several tests with Line 70 active and then with Line 71.

Short story; there's no difference unfortunately.

Longer story: The gprMax header states it found 16 cores and 32 hyperthreads but it lists "Number of CPU (Open MP) threads: 16" when the model starts. 8 cores are used during modeling and the other 8 are only used during read/write operations. I ran tests with a taguchi optimization, the sample BScan over a cylinder model with 60 runs, and the 100x100x100 benchmark. It's harder to see what's happening while running the BScan because it moves pretty quickly (~1.25 seconds per run) but only 8 cores are being used. I've included some screen captures from the taguchi, which is probably most telling, and the benchmarking.

Capture gprMax Header.pdf Capture Line 70 active.pdf Capture Line 71 active.pdf Line 70 Active - supermicro_super_server;_intel(r)_xeon(r)_cpu_e5-2630v3@_2.40ghz;_windows7(64-bit).pdf Line 71 Active -supermicro_super_server;_intel(r)_xeon(r)_cpu_e5-2630v3@_2.40ghz;_windows7(64-bit).pdf

craig-warren commented 7 years ago

@Mark-Dunscomb thanks for all the detailed feedback.

The report of 16 cores and 32 cores with Hyper-Threading are the totals for your system which is correct. Then I would expect gprMax to use 8 physical cores on each CPU which is why Number of CPU (OpenMP) threads: 16 is reported. Can you do a couple more things so we can try and be sure of what is going on:

Run the benchmark_200x200x200.in because the bigger models scale better with more threads. I want to be sure the drop off in performance above 8 threads is not just related to the small model (100x100x100) size.
Take a screenshot from Resource Monitor (built into Windows) CPU tab when the benchmark model is running. You probably need the View small setting to fit a view of all your cores in!

Thanks,

Craig

Mark-Dunscomb commented 7 years ago

@craig-warren I've attached the benchmarking results along with a group of screen shots. I couldn't get all the threads on the same screen within Resource Monitor but could see them all within the Performance view so there are images from both. Interesting that the Resource Monitor showed that half of the threads are parked. I'm working on removing that option and will let you know if that helps. Images.zip benchmarking Files.zip

Mark-Dunscomb commented 7 years ago

@craig-warren Disabling core parking unfortunately does not make a difference. My hope was that the core parking preventing gprMax from using those cores but they seem to be parked because they are not being used (i.e. the software is not accessing them).

craig-warren commented 7 years ago

@Mark-Dunscomb core parking certainly seems a bit odd, especially for a desktop machine - it sounds like a power-saving feature for mobile chips. Can you add the line #num_threads: 32 to your input file. The performance will not be that great but at least we might see if we can get all the cores (physical + HT) running and shown in Task Manager or Resource Monitor.

Mark-Dunscomb commented 7 years ago

@craig-warren That produced an interesting warning in the model run header but did get all 32 threads going at 100%

num_threads 32.zip

craig-warren commented 7 years ago

@Mark-Dunscomb Good news that all the 32 cores got going. I would have expected that message since you have 16 physical cores + 16 HT cores. I need to think some more on why requesting 16 physical cores does not appear to work as expected on your machine.

An interesting experiment to test the performance with Hyper-Threading on your machine. If you edit the gprMax.py module at line 206 and change maxthreads = hostinfo['physicalcores'] to maxthreads = hostinfo['logicalcores'] then run the 200x200x200 benchmark. Lets see what the speedup graph looks like. Based on HT on other machines I would expect the performance to degrade when you go above the number of physical cores, i.e. 16.

Mark-Dunscomb commented 7 years ago

@craig-warren You called it. This leads me to the obvious question, is this a fruitless venture? I.e. is the overhead using that many threads so significant that the overall process is slower? Would it also be true modeling a B-Scan on a large model space?

benchmarking force all threads.zip

zhoufeng617 commented 6 years ago

So，how is going on? I am also wondering if it is worthy of buying a second CPU processor on my tower station for gprMax running.

In principle, dual CPU processors with shared memory should utilize all the cores on both CPUs with OPENMP. Does it work in practice?

Additionally, I enjoy the fast computation of GPU. If I install the secondary GPU on my tower, is it more like a distributed memory, like mpi? I see in the manual that: " (gprMax)$ python -m gprMax user_models/cylinder_Bscan_2D.in -n 60 -mpi 5 -gpu Note: The argument given with -mpi is number of MPI tasks, i.e. master + workers, for MPI task farm. So in this case, 1 master (CPU) and 4 workers (GPU cards)." Does that mean GPUs, even on the same machine, can not treated as a shard-memory architecture, and they are more like the multiple-nodes in a cluster?

craig-warren commented 6 years ago

@zhoufeng617 the CPU-based solver which is parallelised using OpenMP can use all available CPU cores, but the speedup is not linear. Take a look at some of the plots in the performance section of the docs for examples on desktop, server, and HPC - http://docs.gprmax.com/en/latest/benchmarking.html

The use of MPI with GPUs is the same as for CPU, i.e. for task farming - distributing models to different GPUs or nodes in the case of CPU. Multiple GPUs are not treated as shared memory, i.e. you cannot distribute a large model among multiple GPUs. There are ways to mitigate this somewhat using things like NVLink - http://www.nvidia.com/object/nvlink.html

gprMax / gprMax

OpenMP - core utilisation on multiple CPU #94