trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.21k stars 565 forks source link

Cori KNL: best practices #1727

Closed jhux2 closed 3 years ago

jhux2 commented 7 years ago

I am opening this issue to record best practices for achieving good performance on the Cori KNL partition. @jjellio Could I ask you to document your recommendations in this ticket? I've had a number of questions regarding this. Thanks!

@sayerhs @aprokop @alanw0 @spdomin

aprokop commented 7 years ago

Per @jjellio :

My experience with Cray machines has been:

1) Restrict jobs so that they span as few switches as possible. For slurm, this is via the --switches=N@. N specifies the switch count, and max wait is a measure of time. It accepts hh:mm:ss, and days-hh as well as many others (man sbatch for info).

To figure out the number of switches (N), you need to understand the machine's layout. For Cori, two cabinets share a switch, so you need to compute the number of nodes per cabinet, times 2, and the minimum number of switches is then the ceiling of your desired node count by that magic number. E.g., on Cori this is 1 switch per 392 nodes.

2) Compile with hugepage support. You can use which ever pagesize you like, and unless your code depends in some way on the page size (hopefully it doesn't), you can then use whichever pagesize you want at runtime. (But you must compile with Hugepages, this modifies the link line, as well as sets important system variables that define the page size)

2.b) When you run codes compiled with a hugepage support, and you wish to use hugepages, you need to load that module before running. (This is when you can choose a pagesize as well). On Cray systems, the module system sets up environment variables that control the hugepage library that was linked. This is odd, since you now need to load a module from build time at runtime.

3) Make sure the KNL system admins have installed Intel's zonesort kernel module. If they have not (or will not), you should reboot the node before use. KNL performance in Cache mode can degrade over time due to ordering (and availability) of free memory pages. Zonesort compacts and sorts the list of free pages so that future memory allocations will not cause unnecessary HBM cache conflicts.

Rebooting may be the preferred ticket though. If Hugepages are used, then available memory can become fragmented to the point that the OS is unable to obtain a Hugepage, in this case the OS defaults to providing 4KB pages again. When this happens, you then observe performance variations between nodes that are use all hugepages and those that are using on a few hugepages + 4KB pages. This is easy to observe by using the largest hugepage sizes (e.g., 512MB)

4) Check the uptime of the system. If the node has been up for days, I would reboot it. It is infuriating that KNL is so fragile w.r.t. the system state, but my best runtimes are usually right after the system was down for maintenance (I believe that is due to all nodes in my allocation being freshly booted)

That said, rebooting is expensive. Most systems charge your allocation for reboot time, and it can take around 20mins for a node to fully power cycle. This is also risky, because the node may not come back. (power cycling seems to be hard the systems)

5) Pin your threads to cores or to hardware threads. I use OMP_PLACES=threads

I am actively looking into ways to control MPI progress threads. You really do not want any thread active waiting on a KNL core that has other threads bound to it. You are pretty much guaranteed to observe around a 50% performance hit, because the active wait will force the core to share those resources with all threads contending.

6) Don't run on all 68 cores (if on the 68 core variant), leave a few cores free for the OS. (slurm provides this via --core_spec=N, where N is the number of cores to reserve). There is atleast one paper that explored this. The recommendation was leave 2 or 4 free. (leaving 4 free makes the node much easier to use, since you will have 64 cores)

7) (Trilinos specific), I've seen very nice results with 1 task on a KNL node. This results in a SerialComm, and that seems to be a very good thing. E.g., try to use repartitioning that will ultimately pack a level unto a single node, and MueLu's level logic should be using SerialComms if only one task is around. My data to support this is from Tpetra's SpMV. Regardless this seems like a good design decision anyway.

8) You need to do all of the above, all the time. Zonesort, hugepages, and node uptime all work together to make the system more stable.

sayerhs commented 7 years ago

@jjellio Thanks for the comprehensive tips regarding compilation and running on KNL nodes. I had a few follow up question:

jjellio commented 7 years ago

No.

Since this thread is Cori specific, the way to use hugepages is:

1) module load craype-hugepages2M This will add the correct include and linker arguments

2) compile like normal. Cray uses static linking + compiler wrappers, so the mojo from 'module load' is hidden. You can inspect your shell environment to see it though, e.g., env | grep HUGE

3) Run the code**: module load craype-hugepages2M ./run

** I am unsure if you need to set specific MPICH variables that control hugepage use. You can set: export MPICH_ENV_DISPLAY=verbose Which will then dump the MPI environment when your code runs. (rank 0 dumps) There are a few HUGEPAGE settings, some are deprecated. It is not clear to me if MPICH will default to using Hugepages if these are not enabled.

The documentation for hugepages (man intro_hugepages), states that by default the Hugepage modules force alignment to be on 512MB boundaries. By doing so, they are effectively making is so you can use any Hugepage size, since 2MB pages will always align if 512MB alignment is used. This means you should be able to compile with any hugepage module, and run with a different one.

I performed a test that looked at hugepage size, the goal is to choose the smallest page size that offers the best performance. The reason for that logic, is that if the page size is too large, and your app has many procs per node, then you can reach a point where there is not a contiguous chunk of memory large enough to allocate.

This data is from Cori runs using flat MPI with 64 MPI tasks per KNL node, and one node. I wrote a complicated scoring system that tried to weight times and choose a winner. But I ended up looking at the data and choosing 2MB. It's rarely the best, but the relative difference between the best and 2MB is usually really small. I'd choose 2, 8 or 16 and not worry about it.

Each subgroup is sorted by the timing.

Truncated Timer Name Huge Page Size MinOverProcs (seconds)
MueLu: AggregationPhase2aAlgorithm: BuildAggregates (total 128 0.3194
MueLu: AggregationPhase2aAlgorithm: BuildAggregates (total 256 0.3197
MueLu: AggregationPhase2aAlgorithm: BuildAggregates (total 4 0.3197
MueLu: AggregationPhase2aAlgorithm: BuildAggregates (total 64 0.3197
MueLu: AggregationPhase2aAlgorithm: BuildAggregates (total 8 0.3197
MueLu: AggregationPhase2aAlgorithm: BuildAggregates (total 16 0.3201
MueLu: AggregationPhase2aAlgorithm: BuildAggregates (total 2 0.3203
MueLu: AggregationPhase2aAlgorithm: BuildAggregates (total 32 0.3207
MueLu: AggregationPhase2aAlgorithm: BuildAggregates (total 512 0.3213
MueLu: AggregationPhase2aAlgorithm: BuildAggregates (total none 0.3213
---------------------------------------------------------------------- -------- ----------
---------------------------------------------------------------------- -------- ----------
MueLu: AggregationPhase2bAlgorithm: BuildAggregates (total 4 0.6041
MueLu: AggregationPhase2bAlgorithm: BuildAggregates (total 128 0.6045
MueLu: AggregationPhase2bAlgorithm: BuildAggregates (total 16 0.6049
MueLu: AggregationPhase2bAlgorithm: BuildAggregates (total 256 0.605
MueLu: AggregationPhase2bAlgorithm: BuildAggregates (total 64 0.6051
MueLu: AggregationPhase2bAlgorithm: BuildAggregates (total 2 0.6052
MueLu: AggregationPhase2bAlgorithm: BuildAggregates (total 8 0.6054
MueLu: AggregationPhase2bAlgorithm: BuildAggregates (total 32 0.6056
MueLu: AggregationPhase2bAlgorithm: BuildAggregates (total 512 0.6075
MueLu: AggregationPhase2bAlgorithm: BuildAggregates (total none 0.6086
---------------------------------------------------------------------- -------- ----------
---------------------------------------------------------------------- -------- ----------
MueLu: AggregationPhase3Algorithm: BuildAggregates (total 4 0.01899
MueLu: AggregationPhase3Algorithm: BuildAggregates (total 64 0.01911
MueLu: AggregationPhase3Algorithm: BuildAggregates (total 128 0.01914
MueLu: AggregationPhase3Algorithm: BuildAggregates (total 512 0.01914
MueLu: AggregationPhase3Algorithm: BuildAggregates (total 2 0.01918
MueLu: AggregationPhase3Algorithm: BuildAggregates (total 256 0.01918
MueLu: AggregationPhase3Algorithm: BuildAggregates (total 16 0.01919
MueLu: AggregationPhase3Algorithm: BuildAggregates (total 32 0.01922
MueLu: AggregationPhase3Algorithm: BuildAggregates (total 8 0.01931
MueLu: AggregationPhase3Algorithm: BuildAggregates (total none 0.01968
---------------------------------------------------------------------- -------- ----------
---------------------------------------------------------------------- -------- ----------
MueLu: AmalgamationFactory: Build (total 16 0.006072
MueLu: AmalgamationFactory: Build (total 128 0.006167
MueLu: AmalgamationFactory: Build (total 8 0.006172
MueLu: AmalgamationFactory: Build (total 512 0.006173
MueLu: AmalgamationFactory: Build (total 64 0.006196
MueLu: AmalgamationFactory: Build (total 256 0.006197
MueLu: AmalgamationFactory: Build (total none 0.006198
MueLu: AmalgamationFactory: Build (total 2 0.006238
MueLu: AmalgamationFactory: Build (total 4 0.006259
MueLu: AmalgamationFactory: Build (total 32 0.006287
---------------------------------------------------------------------- -------- ----------
---------------------------------------------------------------------- -------- ----------
MueLu: CoalesceDropFactory: Build (total 16 1.083
MueLu: CoalesceDropFactory: Build (total 64 1.088
MueLu: CoalesceDropFactory: Build (total 8 1.089
MueLu: CoalesceDropFactory: Build (total 4 1.094
MueLu: CoalesceDropFactory: Build (total 2 1.096
MueLu: CoalesceDropFactory: Build (total 32 1.102
MueLu: CoalesceDropFactory: Build (total none 1.102
MueLu: CoalesceDropFactory: Build (total 256 1.119
MueLu: CoalesceDropFactory: Build (total 128 1.122
MueLu: CoalesceDropFactory: Build (total 512 1.153
---------------------------------------------------------------------- -------- ----------
---------------------------------------------------------------------- -------- ----------
MueLu: CoarseMapFactory: Build (total 32 0.01817
MueLu: CoarseMapFactory: Build (total 16 0.01839
MueLu: CoarseMapFactory: Build (total 2 0.01839
MueLu: CoarseMapFactory: Build (total 128 0.01851
MueLu: CoarseMapFactory: Build (total 8 0.01878
MueLu: CoarseMapFactory: Build (total none 0.01883
MueLu: CoarseMapFactory: Build (total 4 0.01886
MueLu: CoarseMapFactory: Build (total 256 0.01921
MueLu: CoarseMapFactory: Build (total 512 0.01925
MueLu: CoarseMapFactory: Build (total 64 0.01925
---------------------------------------------------------------------- -------- ----------
---------------------------------------------------------------------- -------- ----------
MueLu: CoordinatesTransferFactory: Build (total 2 0.1842
MueLu: CoordinatesTransferFactory: Build (total 32 0.1849
MueLu: CoordinatesTransferFactory: Build (total 16 0.1851
MueLu: CoordinatesTransferFactory: Build (total 4 0.1851
MueLu: CoordinatesTransferFactory: Build (total 64 0.1853
MueLu: CoordinatesTransferFactory: Build (total 8 0.186
MueLu: CoordinatesTransferFactory: Build (total 256 0.1861
MueLu: CoordinatesTransferFactory: Build (total 128 0.1869
MueLu: CoordinatesTransferFactory: Build (total 512 0.1909
MueLu: CoordinatesTransferFactory: Build (total none 0.1993
---------------------------------------------------------------------- -------- ----------
---------------------------------------------------------------------- -------- ----------
MueLu: FilteredAFactory: Matrix filtering (total 16 1.098
MueLu: FilteredAFactory: Matrix filtering (total 64 1.103
MueLu: FilteredAFactory: Matrix filtering (total 8 1.104
MueLu: FilteredAFactory: Matrix filtering (total 4 1.109
MueLu: FilteredAFactory: Matrix filtering (total 2 1.111
MueLu: FilteredAFactory: Matrix filtering (total none 1.116
MueLu: FilteredAFactory: Matrix filtering (total 32 1.117
MueLu: FilteredAFactory: Matrix filtering (total 256 1.135
MueLu: FilteredAFactory: Matrix filtering (total 128 1.137
MueLu: FilteredAFactory: Matrix filtering (total 512 1.169
---------------------------------------------------------------------- -------- ----------
---------------------------------------------------------------------- -------- ----------
MueLu: Hierarchy: Setup (total 16 55.7
MueLu: Hierarchy: Setup (total 32 55.73
MueLu: Hierarchy: Setup (total 4 55.73
MueLu: Hierarchy: Setup (total 8 55.81
MueLu: Hierarchy: Setup (total 128 56.18
MueLu: Hierarchy: Setup (total 2 56.19
MueLu: Hierarchy: Setup (total 64 56.23
MueLu: Hierarchy: Setup (total 256 58.17
MueLu: Hierarchy: Setup (total none 59.33
MueLu: Hierarchy: Setup (total 512 60.83
---------------------------------------------------------------------- -------- ----------
---------------------------------------------------------------------- -------- ----------
MueLu: Ifpack2Smoother: Setup Smoother (total 16 9.142
MueLu: Ifpack2Smoother: Setup Smoother (total 2 9.154
MueLu: Ifpack2Smoother: Setup Smoother (total 32 9.186
MueLu: Ifpack2Smoother: Setup Smoother (total 8 9.186
MueLu: Ifpack2Smoother: Setup Smoother (total 64 9.199
MueLu: Ifpack2Smoother: Setup Smoother (total 4 9.206
MueLu: Ifpack2Smoother: Setup Smoother (total 128 9.219
MueLu: Ifpack2Smoother: Setup Smoother (total none 9.458
MueLu: Ifpack2Smoother: Setup Smoother (total 256 9.595
MueLu: Ifpack2Smoother: Setup Smoother (total 512 10.37
---------------------------------------------------------------------- -------- ----------
---------------------------------------------------------------------- -------- ----------
MueLu: NullspaceFactory: Nullspace factory (total 128 0.003001
MueLu: NullspaceFactory: Nullspace factory (total none 0.003043
MueLu: NullspaceFactory: Nullspace factory (total 256 0.00305
MueLu: NullspaceFactory: Nullspace factory (total 4 0.00306
MueLu: NullspaceFactory: Nullspace factory (total 64 0.003092
MueLu: NullspaceFactory: Nullspace factory (total 32 0.003115
MueLu: NullspaceFactory: Nullspace factory (total 512 0.003119
MueLu: NullspaceFactory: Nullspace factory (total 16 0.003122
MueLu: NullspaceFactory: Nullspace factory (total 8 0.003136
MueLu: NullspaceFactory: Nullspace factory (total 2 0.003422
---------------------------------------------------------------------- -------- ----------
---------------------------------------------------------------------- -------- ----------
MueLu: PreserveDirichletAggregationAlgorithm: BuildAggregates (total 512 0.0137
MueLu: PreserveDirichletAggregationAlgorithm: BuildAggregates (total 64 0.01372
MueLu: PreserveDirichletAggregationAlgorithm: BuildAggregates (total 128 0.01374
MueLu: PreserveDirichletAggregationAlgorithm: BuildAggregates (total 8 0.01376
MueLu: PreserveDirichletAggregationAlgorithm: BuildAggregates (total 2 0.0138
MueLu: PreserveDirichletAggregationAlgorithm: BuildAggregates (total 32 0.01387
MueLu: PreserveDirichletAggregationAlgorithm: BuildAggregates (total 4 0.01399
MueLu: PreserveDirichletAggregationAlgorithm: BuildAggregates (total 16 0.01402
MueLu: PreserveDirichletAggregationAlgorithm: BuildAggregates (total 256 0.01407
MueLu: PreserveDirichletAggregationAlgorithm: BuildAggregates (total none 0.01415
---------------------------------------------------------------------- -------- ----------
---------------------------------------------------------------------- -------- ----------
MueLu: RAPFactory: Computing Ac (total 4 45.46
MueLu: RAPFactory: Computing Ac (total 32 45.48
MueLu: RAPFactory: Computing Ac (total 16 45.49
MueLu: RAPFactory: Computing Ac (total 8 45.56
MueLu: RAPFactory: Computing Ac (total 128 45.89
MueLu: RAPFactory: Computing Ac (total 64 45.97
MueLu: RAPFactory: Computing Ac (total 2 45.98
MueLu: RAPFactory: Computing Ac (total 256 47.47
MueLu: RAPFactory: Computing Ac (total none 48.79
MueLu: RAPFactory: Computing Ac (total 512 49.3
---------------------------------------------------------------------- -------- ----------
---------------------------------------------------------------------- -------- ----------
MueLu: RebalanceAcFactory: Computing Ac (total 2 0.354
MueLu: RebalanceAcFactory: Computing Ac (total 8 0.3547
MueLu: RebalanceAcFactory: Computing Ac (total 128 0.3561
MueLu: RebalanceAcFactory: Computing Ac (total 32 0.3564
MueLu: RebalanceAcFactory: Computing Ac (total 4 0.3565
MueLu: RebalanceAcFactory: Computing Ac (total 64 0.358
MueLu: RebalanceAcFactory: Computing Ac (total 16 0.3591
MueLu: RebalanceAcFactory: Computing Ac (total none 0.3609
MueLu: RebalanceAcFactory: Computing Ac (total 256 0.3668
MueLu: RebalanceAcFactory: Computing Ac (total 512 0.3676
---------------------------------------------------------------------- -------- ----------
---------------------------------------------------------------------- -------- ----------
MueLu: RebalanceTransferFactory: Build (total 4 46.02
MueLu: RebalanceTransferFactory: Build (total 16 46.05
MueLu: RebalanceTransferFactory: Build (total 32 46.05
MueLu: RebalanceTransferFactory: Build (total 8 46.13
MueLu: RebalanceTransferFactory: Build (total 128 46.46
MueLu: RebalanceTransferFactory: Build (total 64 46.54
MueLu: RebalanceTransferFactory: Build (total 2 46.55
MueLu: RebalanceTransferFactory: Build (total 256 48.06
MueLu: RebalanceTransferFactory: Build (total none 49.37
MueLu: RebalanceTransferFactory: Build (total 512 49.93
---------------------------------------------------------------------- -------- ----------
---------------------------------------------------------------------- -------- ----------
MueLu: RepartitionFactory: Build (total 4 45.97
MueLu: RepartitionFactory: Build (total 32 45.99
MueLu: RepartitionFactory: Build (total 16 46
MueLu: RepartitionFactory: Build (total 8 46.07
MueLu: RepartitionFactory: Build (total 128 46.41
MueLu: RepartitionFactory: Build (total 64 46.48
MueLu: RepartitionFactory: Build (total 2 46.49
MueLu: RepartitionFactory: Build (total 256 48.01
MueLu: RepartitionFactory: Build (total none 49.31
MueLu: RepartitionFactory: Build (total 512 49.87
---------------------------------------------------------------------- -------- ----------
---------------------------------------------------------------------- -------- ----------
MueLu: RepartitionHeuristicFactory: Build (total 128 0.01811
MueLu: RepartitionHeuristicFactory: Build (total 2 0.01827
MueLu: RepartitionHeuristicFactory: Build (total 4 0.01828
MueLu: RepartitionHeuristicFactory: Build (total none 0.01835
MueLu: RepartitionHeuristicFactory: Build (total 512 0.01839
MueLu: RepartitionHeuristicFactory: Build (total 64 0.01858
MueLu: RepartitionHeuristicFactory: Build (total 256 0.01868
MueLu: RepartitionHeuristicFactory: Build (total 16 0.01888
MueLu: RepartitionHeuristicFactory: Build (total 8 0.01907
MueLu: RepartitionHeuristicFactory: Build (total 32 0.0193
---------------------------------------------------------------------- -------- ----------
---------------------------------------------------------------------- -------- ----------
MueLu: SaPFactory: Prolongator smoothing (total 32 18.68
MueLu: SaPFactory: Prolongator smoothing (total 16 18.7
MueLu: SaPFactory: Prolongator smoothing (total 4 18.72
MueLu: SaPFactory: Prolongator smoothing (total 8 18.75
MueLu: SaPFactory: Prolongator smoothing (total 128 18.94
MueLu: SaPFactory: Prolongator smoothing (total 2 19
MueLu: SaPFactory: Prolongator smoothing (total 64 19.09
MueLu: SaPFactory: Prolongator smoothing (total none 19.58
MueLu: SaPFactory: Prolongator smoothing (total 256 19.86
MueLu: SaPFactory: Prolongator smoothing (total 512 20.53
---------------------------------------------------------------------- -------- ----------
---------------------------------------------------------------------- -------- ----------
MueLu: TentativePFactory: Build (total 32 3.894
MueLu: TentativePFactory: Build (total 64 3.898
MueLu: TentativePFactory: Build (total 2 3.9
MueLu: TentativePFactory: Build (total 4 3.9
MueLu: TentativePFactory: Build (total 128 3.901
MueLu: TentativePFactory: Build (total 16 3.901
MueLu: TentativePFactory: Build (total 8 3.901
MueLu: TentativePFactory: Build (total none 4.001
MueLu: TentativePFactory: Build (total 256 4.003
MueLu: TentativePFactory: Build (total 512 4.218
---------------------------------------------------------------------- -------- ----------
---------------------------------------------------------------------- -------- ----------
MueLu: UncoupledAggregationFactory: Build (total 16 2.5
MueLu: UncoupledAggregationFactory: Build (total 2 2.5
MueLu: UncoupledAggregationFactory: Build (total 32 2.5
MueLu: UncoupledAggregationFactory: Build (total 4 2.5
MueLu: UncoupledAggregationFactory: Build (total 8 2.501
MueLu: UncoupledAggregationFactory: Build (total 128 2.502
MueLu: UncoupledAggregationFactory: Build (total 64 2.502
MueLu: UncoupledAggregationFactory: Build (total 256 2.508
MueLu: UncoupledAggregationFactory: Build (total none 2.527
MueLu: UncoupledAggregationFactory: Build (total 512 2.619
---------------------------------------------------------------------- -------- ----------
---------------------------------------------------------------------- -------- ----------
MueLu: Zoltan2Interface: Build (total 4 0.1807
MueLu: Zoltan2Interface: Build (total 8 0.1815
MueLu: Zoltan2Interface: Build (total 16 0.1818
MueLu: Zoltan2Interface: Build (total 64 0.1818
MueLu: Zoltan2Interface: Build (total none 0.1832
MueLu: Zoltan2Interface: Build (total 2 0.1833
MueLu: Zoltan2Interface: Build (total 128 0.1835
MueLu: Zoltan2Interface: Build (total 32 0.1837
MueLu: Zoltan2Interface: Build (total 256 0.1968
MueLu: Zoltan2Interface: Build (total 512 0.2022
---------------------------------------------------------------------- ------- ----------
---------------------------------------------------------------------- ------- ----------

Edited to include the data when hugepages = none

jjellio commented 7 years ago

As a brief followup, you may noticed that a pagesize of 512MB is typically the worst, and that the timing can be significantly longer than the best. E.g., repartition, rebalance, RAP, Hierarchy...

What happens is that with a page size so large, the system is not able to allocate 512MB pages all the time. When a Hugepage allocation fails, it defaults to the system's pagesize, which is 4KB. In prior studies, we observed extreme performance variations (we did not use huge pages). It is very likely that those variations were due to poor page allocations, which result in poor cache utilization. (I use HBM in cache mode).

sayerhs commented 7 years ago

@jjellio Wow! Thank you for the comprehensive response. Answered a lot of my questions, and a few that I didn't know to ask. I'll use these tips to setup the Nalu runs on Cori.

jjellio commented 7 years ago

@sayerhs My experience is with Quad/Cache mode. I suspect that turning off HBM (quad/flat) may eliminate some issues, but things like hugepages will still show a benefit, because they improve the performance of the TLB.

From my experience, no apps I've used can do real science and live entirely in 15GB of HBM memory. Therefore I tested with HBM in cache mode (which is Cori's default).

sayerhs commented 7 years ago

@jjellio : Have you investigated the use of autohbw for array allocation, if so, what is your recommendation there? My understanding is that it is only used when using the node in flat mode, is that correct? Thanks in advance.

jjellio commented 7 years ago

I have not used it. You can restrict all allocations to HBM in flat mode with numactl -m 1, which will strictly bind the process into the HBM numa domain. Once 15GB is reached, malloc's will fail and the code will crash.

You can also use preferred binding srun numactl -p 1 ./myapp

With preferred, the app will try to allocate in HBM first, but if that fails it will fall back to another NUMA domain (DDR4).

How will autohbw provide utility beyond numactl memory binding? Apparently, you can restrict autohbw so that it will only use HBM for allocations of a certain size. That seems to be its selling point (from what I have read).

I have avoided flat mode + preferred binding because I do not want the case that some processes perform slightly faster than others. Since you do not know what is HBM and what is not, this seems like it would only lead to confusing performance analysis.

jhux2 commented 7 years ago

@jjellio @srajama1 Question from @sayerhs:

"I guess I don't understand how to turn off hyperthreading... is it just sufficient to set OMP_PLACES = cores for that and set -c flag with srun?"

ibaned commented 7 years ago

@dsunder

jjellio commented 7 years ago

There are a few ways you can avoid having multiple threads per core

Assume cores_per_process is the number of cores you are giving to each process

  1. set OMP_NUM_THREADS needs to match the number of cores you are giving to each task.
  2. set OMP_PROC_BIND=spread
  3. set OMP_PLACES="cores($cores_per_process)" or OMP_PLACES=threads.
  4. invoke srun (I am assuming this is SLURM) srun -c $(( 4 * $cores_per_process )) --cpu_bind=cores

What this will achieve: Srun will create a CPU binding mask that restricts your process to a fixed number of cores and all 4 HTs for each core.

OMP_PLACES will the tell OpenMP to bind each OpenMP thread so that it can use any resource inside a core (but none elsewhere).

OMP_PROC_BIND=spread, tells OpenMP to evenly distribute the threads across the process mask. Since you only have as many threads as there are cores, this will place 1 thread per core.


What all of the above effectively does, is that it binds one thread to each core. The thread can choose to run on any of the 4HTs in that core.

If you choose OMP_PLACES=threads, then it binds one thread to a single HT in each core, and that thread cannot change HTs within the core.

OMP_PLACES=threads should be the best choice, as it prevents unneeded movement of the thread within a core.


If you are using one thread per core, and your code spends most of its time in OpenMP regions, you might see a gain from setting OMP_WAIT_POLICY=active (never set this if you use HTs). On KNL, the default is usually 'passive'


You can also instruct SLURM to completely remove the HTs from the task's process mask. --hint=nomultithread. A typical KNL process mask is 68 bits repeated 4 times. A one indicates the process can run on that core, a zero means no. What nomultithread does, is it puts zeros in all 3x68 bits, and only puts ones in the first 68 bits of the mask. Effectively, this 'hides' the hardware threads from your process. I haven't looked to see how OpenMP behaves with this option + OMP_PLACES=cores. Ideally, OpenMP should see 68 cores with only 1 thread possible in each.

The above issue was pretty annoying to track down.


TLDR:

export OMP_PLACES=threads
export OMP_PROC_BIND=spread
export OMP_NUM_THREADS=${cores_per_proc}

srun -c $(( 4*${cores_per_proc} )) --cpu_bind=cores

Add --cpu_bind=cores,verbose, and you can see the 68x4 bit masks for each process.

github-actions[bot] commented 3 years ago

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. If you would like to keep this issue open please add a comment and remove the MARKED_FOR_CLOSURE label. If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.

github-actions[bot] commented 3 years ago

This issue was closed due to inactivity for 395 days.