Cori KNL: best practices

jhux2 commented 7 years ago

I am opening this issue to record best practices for achieving good performance on the Cori KNL partition. @jjellio Could I ask you to document your recommendations in this ticket? I've had a number of questions regarding this. Thanks!

@sayerhs @aprokop @alanw0 @spdomin

aprokop commented 7 years ago

Per @jjellio :

My experience with Cray machines has been:

1) Restrict jobs so that they span as few switches as possible. For slurm, this is via the --switches=N@. N specifies the switch count, and max wait is a measure of time. It accepts hh:mm:ss, and days-hh as well as many others (man sbatch for info).

To figure out the number of switches (N), you need to understand the machine's layout. For Cori, two cabinets share a switch, so you need to compute the number of nodes per cabinet, times 2, and the minimum number of switches is then the ceiling of your desired node count by that magic number. E.g., on Cori this is 1 switch per 392 nodes.

2) Compile with hugepage support. You can use which ever pagesize you like, and unless your code depends in some way on the page size (hopefully it doesn't), you can then use whichever pagesize you want at runtime. (But you must compile with Hugepages, this modifies the link line, as well as sets important system variables that define the page size)

2.b) When you run codes compiled with a hugepage support, and you wish to use hugepages, you need to load that module before running. (This is when you can choose a pagesize as well). On Cray systems, the module system sets up environment variables that control the hugepage library that was linked. This is odd, since you now need to load a module from build time at runtime.

3) Make sure the KNL system admins have installed Intel's zonesort kernel module. If they have not (or will not), you should reboot the node before use. KNL performance in Cache mode can degrade over time due to ordering (and availability) of free memory pages. Zonesort compacts and sorts the list of free pages so that future memory allocations will not cause unnecessary HBM cache conflicts.

Rebooting may be the preferred ticket though. If Hugepages are used, then available memory can become fragmented to the point that the OS is unable to obtain a Hugepage, in this case the OS defaults to providing 4KB pages again. When this happens, you then observe performance variations between nodes that are use all hugepages and those that are using on a few hugepages + 4KB pages. This is easy to observe by using the largest hugepage sizes (e.g., 512MB)

4) Check the uptime of the system. If the node has been up for days, I would reboot it. It is infuriating that KNL is so fragile w.r.t. the system state, but my best runtimes are usually right after the system was down for maintenance (I believe that is due to all nodes in my allocation being freshly booted)

That said, rebooting is expensive. Most systems charge your allocation for reboot time, and it can take around 20mins for a node to fully power cycle. This is also risky, because the node may not come back. (power cycling seems to be hard the systems)

5) Pin your threads to cores or to hardware threads. I use OMP_PLACES=threads

I am actively looking into ways to control MPI progress threads. You really do not want any thread active waiting on a KNL core that has other threads bound to it. You are pretty much guaranteed to observe around a 50% performance hit, because the active wait will force the core to share those resources with all threads contending.

6) Don't run on all 68 cores (if on the 68 core variant), leave a few cores free for the OS. (slurm provides this via --core_spec=N, where N is the number of cores to reserve). There is atleast one paper that explored this. The recommendation was leave 2 or 4 free. (leaving 4 free makes the node much easier to use, since you will have 64 cores)

7) (Trilinos specific), I've seen very nice results with 1 task on a KNL node. This results in a SerialComm, and that seems to be a very good thing. E.g., try to use repartitioning that will ultimately pack a level unto a single node, and MueLu's level logic should be using SerialComms if only one task is around. My data to support this is from Tpetra's SpMV. Regardless this seems like a good design decision anyway.

8) You need to do all of the above, all the time. Zonesort, hugepages, and node uptime all work together to make the system more stable.

sayerhs commented 7 years ago

@jjellio Thanks for the comprehensive tips regarding compilation and running on KNL nodes. I had a few follow up question:

Is there a Trilinos CMake option that I can turn on to enable -lhugepage during link time? Or is that something we should be manually adding to the CMake linker flags?

jjellio commented 7 years ago

No.

Since this thread is Cori specific, the way to use hugepages is:

1) module load craype-hugepages2M This will add the correct include and linker arguments

2) compile like normal. Cray uses static linking + compiler wrappers, so the mojo from 'module load' is hidden. You can inspect your shell environment to see it though, e.g., env | grep HUGE

3) Run the code**: module load craype-hugepages2M ./run

** I am unsure if you need to set specific MPICH variables that control hugepage use. You can set: export MPICH_ENV_DISPLAY=verbose Which will then dump the MPI environment when your code runs. (rank 0 dumps) There are a few HUGEPAGE settings, some are deprecated. It is not clear to me if MPICH will default to using Hugepages if these are not enabled.

The documentation for hugepages (man intro_hugepages), states that by default the Hugepage modules force alignment to be on 512MB boundaries. By doing so, they are effectively making is so you can use any Hugepage size, since 2MB pages will always align if 512MB alignment is used. This means you should be able to compile with any hugepage module, and run with a different one.

I performed a test that looked at hugepage size, the goal is to choose the smallest page size that offers the best performance. The reason for that logic, is that if the page size is too large, and your app has many procs per node, then you can reach a point where there is not a contiguous chunk of memory large enough to allocate.

This data is from Cori runs using flat MPI with 64 MPI tasks per KNL node, and one node. I wrote a complicated scoring system that tried to weight times and choose a winner. But I ended up looking at the data and choosing 2MB. It's rarely the best, but the relative difference between the best and 2MB is usually really small. I'd choose 2, 8 or 16 and not worry about it.

Each subgroup is sorted by the timing.

Truncated Timer Name	Huge Page Size	MinOverProcs (seconds)
MueLu: AggregationPhase2aAlgorithm: BuildAggregates (total	128	0.3194
MueLu: AggregationPhase2aAlgorithm: BuildAggregates (total	256	0.3197
MueLu: AggregationPhase2aAlgorithm: BuildAggregates (total	4	0.3197
MueLu: AggregationPhase2aAlgorithm: BuildAggregates (total	64	0.3197
MueLu: AggregationPhase2aAlgorithm: BuildAggregates (total	8	0.3197
MueLu: AggregationPhase2aAlgorithm: BuildAggregates (total	16	0.3201
MueLu: AggregationPhase2aAlgorithm: BuildAggregates (total	2	0.3203
MueLu: AggregationPhase2aAlgorithm: BuildAggregates (total	32	0.3207
MueLu: AggregationPhase2aAlgorithm: BuildAggregates (total	512	0.3213
MueLu: AggregationPhase2aAlgorithm: BuildAggregates (total	none	0.3213
----------------------------------------------------------------------	--------	----------
----------------------------------------------------------------------	--------	----------
MueLu: AggregationPhase2bAlgorithm: BuildAggregates (total	4	0.6041
MueLu: AggregationPhase2bAlgorithm: BuildAggregates (total	128	0.6045
MueLu: AggregationPhase2bAlgorithm: BuildAggregates (total	16	0.6049
MueLu: AggregationPhase2bAlgorithm: BuildAggregates (total	256	0.605
MueLu: AggregationPhase2bAlgorithm: BuildAggregates (total	64	0.6051
MueLu: AggregationPhase2bAlgorithm: BuildAggregates (total	2	0.6052
MueLu: AggregationPhase2bAlgorithm: BuildAggregates (total	8	0.6054
MueLu: AggregationPhase2bAlgorithm: BuildAggregates (total	32	0.6056
MueLu: AggregationPhase2bAlgorithm: BuildAggregates (total	512	0.6075
MueLu: AggregationPhase2bAlgorithm: BuildAggregates (total	none	0.6086
----------------------------------------------------------------------	--------	----------
----------------------------------------------------------------------	--------	----------
MueLu: AggregationPhase3Algorithm: BuildAggregates (total	4	0.01899
MueLu: AggregationPhase3Algorithm: BuildAggregates (total	64	0.01911
MueLu: AggregationPhase3Algorithm: BuildAggregates (total	128	0.01914
MueLu: AggregationPhase3Algorithm: BuildAggregates (total	512	0.01914
MueLu: AggregationPhase3Algorithm: BuildAggregates (total	2	0.01918
MueLu: AggregationPhase3Algorithm: BuildAggregates (total	256	0.01918
MueLu: AggregationPhase3Algorithm: BuildAggregates (total	16	0.01919
MueLu: AggregationPhase3Algorithm: BuildAggregates (total	32	0.01922
MueLu: AggregationPhase3Algorithm: BuildAggregates (total	8	0.01931
MueLu: AggregationPhase3Algorithm: BuildAggregates (total	none	0.01968
----------------------------------------------------------------------	--------	----------
----------------------------------------------------------------------	--------	----------
MueLu: AmalgamationFactory: Build (total	16	0.006072
MueLu: AmalgamationFactory: Build (total	128	0.006167
MueLu: AmalgamationFactory: Build (total	8	0.006172
MueLu: AmalgamationFactory: Build (total	512	0.006173
MueLu: AmalgamationFactory: Build (total	64	0.006196
MueLu: AmalgamationFactory: Build (total	256	0.006197
MueLu: AmalgamationFactory: Build (total	none	0.006198
MueLu: AmalgamationFactory: Build (total	2	0.006238
MueLu: AmalgamationFactory: Build (total	4	0.006259
MueLu: AmalgamationFactory: Build (total	32	0.006287
----------------------------------------------------------------------	--------	----------
----------------------------------------------------------------------	--------	----------
MueLu: CoalesceDropFactory: Build (total	16	1.083
MueLu: CoalesceDropFactory: Build (total	64	1.088
MueLu: CoalesceDropFactory: Build (total	8	1.089
MueLu: CoalesceDropFactory: Build (total	4	1.094
MueLu: CoalesceDropFactory: Build (total	2	1.096
MueLu: CoalesceDropFactory: Build (total	32	1.102
MueLu: CoalesceDropFactory: Build (total	none	1.102
MueLu: CoalesceDropFactory: Build (total	256	1.119
MueLu: CoalesceDropFactory: Build (total	128	1.122
MueLu: CoalesceDropFactory: Build (total	512	1.153
----------------------------------------------------------------------	--------	----------
----------------------------------------------------------------------	--------	----------
MueLu: CoarseMapFactory: Build (total	32	0.01817
MueLu: CoarseMapFactory: Build (total	16	0.01839
MueLu: CoarseMapFactory: Build (total	2	0.01839
MueLu: CoarseMapFactory: Build (total	128	0.01851
MueLu: CoarseMapFactory: Build (total	8	0.01878
MueLu: CoarseMapFactory: Build (total	none	0.01883
MueLu: CoarseMapFactory: Build (total	4	0.01886
MueLu: CoarseMapFactory: Build (total	256	0.01921
MueLu: CoarseMapFactory: Build (total	512	0.01925
MueLu: CoarseMapFactory: Build (total	64	0.01925
----------------------------------------------------------------------	--------	----------
----------------------------------------------------------------------	--------	----------
MueLu: CoordinatesTransferFactory: Build (total	2	0.1842
MueLu: CoordinatesTransferFactory: Build (total	32	0.1849
MueLu: CoordinatesTransferFactory: Build (total	16	0.1851
MueLu: CoordinatesTransferFactory: Build (total	4	0.1851
MueLu: CoordinatesTransferFactory: Build (total	64	0.1853
MueLu: CoordinatesTransferFactory: Build (total	8	0.186
MueLu: CoordinatesTransferFactory: Build (total	256	0.1861
MueLu: CoordinatesTransferFactory: Build (total	128	0.1869
MueLu: CoordinatesTransferFactory: Build (total	512	0.1909
MueLu: CoordinatesTransferFactory: Build (total	none	0.1993
----------------------------------------------------------------------	--------	----------
----------------------------------------------------------------------	--------	----------
MueLu: FilteredAFactory: Matrix filtering (total	16	1.098
MueLu: FilteredAFactory: Matrix filtering (total	64	1.103
MueLu: FilteredAFactory: Matrix filtering (total	8	1.104
MueLu: FilteredAFactory: Matrix filtering (total	4	1.109
MueLu: FilteredAFactory: Matrix filtering (total	2	1.111
MueLu: FilteredAFactory: Matrix filtering (total	none	1.116
MueLu: FilteredAFactory: Matrix filtering (total	32	1.117
MueLu: FilteredAFactory: Matrix filtering (total	256	1.135
MueLu: FilteredAFactory: Matrix filtering (total	128	1.137
MueLu: FilteredAFactory: Matrix filtering (total	512	1.169
----------------------------------------------------------------------	--------	----------
----------------------------------------------------------------------	--------	----------
MueLu: Hierarchy: Setup (total	16	55.7
MueLu: Hierarchy: Setup (total	32	55.73
MueLu: Hierarchy: Setup (total	4	55.73
MueLu: Hierarchy: Setup (total	8	55.81
MueLu: Hierarchy: Setup (total	128	56.18
MueLu: Hierarchy: Setup (total	2	56.19
MueLu: Hierarchy: Setup (total	64	56.23
MueLu: Hierarchy: Setup (total	256	58.17
MueLu: Hierarchy: Setup (total	none	59.33
MueLu: Hierarchy: Setup (total	512	60.83
----------------------------------------------------------------------	--------	----------
----------------------------------------------------------------------	--------	----------
MueLu: Ifpack2Smoother: Setup Smoother (total	16	9.142
MueLu: Ifpack2Smoother: Setup Smoother (total	2	9.154
MueLu: Ifpack2Smoother: Setup Smoother (total	32	9.186
MueLu: Ifpack2Smoother: Setup Smoother (total	8	9.186
MueLu: Ifpack2Smoother: Setup Smoother (total	64	9.199
MueLu: Ifpack2Smoother: Setup Smoother (total	4	9.206
MueLu: Ifpack2Smoother: Setup Smoother (total	128	9.219
MueLu: Ifpack2Smoother: Setup Smoother (total	none	9.458
MueLu: Ifpack2Smoother: Setup Smoother (total	256	9.595
MueLu: Ifpack2Smoother: Setup Smoother (total	512	10.37
----------------------------------------------------------------------	--------	----------
----------------------------------------------------------------------	--------	----------
MueLu: NullspaceFactory: Nullspace factory (total	128	0.003001
MueLu: NullspaceFactory: Nullspace factory (total	none	0.003043
MueLu: NullspaceFactory: Nullspace factory (total	256	0.00305
MueLu: NullspaceFactory: Nullspace factory (total	4	0.00306
MueLu: NullspaceFactory: Nullspace factory (total	64	0.003092
MueLu: NullspaceFactory: Nullspace factory (total	32	0.003115
MueLu: NullspaceFactory: Nullspace factory (total	512	0.003119
MueLu: NullspaceFactory: Nullspace factory (total	16	0.003122
MueLu: NullspaceFactory: Nullspace factory (total	8	0.003136
MueLu: NullspaceFactory: Nullspace factory (total	2	0.003422
----------------------------------------------------------------------	--------	----------
----------------------------------------------------------------------	--------	----------
MueLu: PreserveDirichletAggregationAlgorithm: BuildAggregates (total	512	0.0137
MueLu: PreserveDirichletAggregationAlgorithm: BuildAggregates (total	64	0.01372
MueLu: PreserveDirichletAggregationAlgorithm: BuildAggregates (total	128	0.01374
MueLu: PreserveDirichletAggregationAlgorithm: BuildAggregates (total	8	0.01376
MueLu: PreserveDirichletAggregationAlgorithm: BuildAggregates (total	2	0.0138
MueLu: PreserveDirichletAggregationAlgorithm: BuildAggregates (total	32	0.01387
MueLu: PreserveDirichletAggregationAlgorithm: BuildAggregates (total	4	0.01399
MueLu: PreserveDirichletAggregationAlgorithm: BuildAggregates (total	16	0.01402
MueLu: PreserveDirichletAggregationAlgorithm: BuildAggregates (total	256	0.01407
MueLu: PreserveDirichletAggregationAlgorithm: BuildAggregates (total	none	0.01415
----------------------------------------------------------------------	--------	----------
----------------------------------------------------------------------	--------	----------
MueLu: RAPFactory: Computing Ac (total	4	45.46
MueLu: RAPFactory: Computing Ac (total	32	45.48
MueLu: RAPFactory: Computing Ac (total	16	45.49
MueLu: RAPFactory: Computing Ac (total	8	45.56
MueLu: RAPFactory: Computing Ac (total	128	45.89
MueLu: RAPFactory: Computing Ac (total	64	45.97
MueLu: RAPFactory: Computing Ac (total	2	45.98
MueLu: RAPFactory: Computing Ac (total	256	47.47
MueLu: RAPFactory: Computing Ac (total	none	48.79
MueLu: RAPFactory: Computing Ac (total	512	49.3
----------------------------------------------------------------------	--------	----------
----------------------------------------------------------------------	--------	----------
MueLu: RebalanceAcFactory: Computing Ac (total	2	0.354
MueLu: RebalanceAcFactory: Computing Ac (total	8	0.3547
MueLu: RebalanceAcFactory: Computing Ac (total	128	0.3561
MueLu: RebalanceAcFactory: Computing Ac (total	32	0.3564
MueLu: RebalanceAcFactory: Computing Ac (total	4	0.3565
MueLu: RebalanceAcFactory: Computing Ac (total	64	0.358
MueLu: RebalanceAcFactory: Computing Ac (total	16	0.3591
MueLu: RebalanceAcFactory: Computing Ac (total	none	0.3609
MueLu: RebalanceAcFactory: Computing Ac (total	256	0.3668
MueLu: RebalanceAcFactory: Computing Ac (total	512	0.3676
----------------------------------------------------------------------	--------	----------
----------------------------------------------------------------------	--------	----------
MueLu: RebalanceTransferFactory: Build (total	4	46.02
MueLu: RebalanceTransferFactory: Build (total	16	46.05
MueLu: RebalanceTransferFactory: Build (total	32	46.05
MueLu: RebalanceTransferFactory: Build (total	8	46.13
MueLu: RebalanceTransferFactory: Build (total	128	46.46
MueLu: RebalanceTransferFactory: Build (total	64	46.54
MueLu: RebalanceTransferFactory: Build (total	2	46.55
MueLu: RebalanceTransferFactory: Build (total	256	48.06
MueLu: RebalanceTransferFactory: Build (total	none	49.37
MueLu: RebalanceTransferFactory: Build (total	512	49.93
----------------------------------------------------------------------	--------	----------
----------------------------------------------------------------------	--------	----------
MueLu: RepartitionFactory: Build (total	4	45.97
MueLu: RepartitionFactory: Build (total	32	45.99
MueLu: RepartitionFactory: Build (total	16	46
MueLu: RepartitionFactory: Build (total	8	46.07
MueLu: RepartitionFactory: Build (total	128	46.41
MueLu: RepartitionFactory: Build (total	64	46.48
MueLu: RepartitionFactory: Build (total	2	46.49
MueLu: RepartitionFactory: Build (total	256	48.01
MueLu: RepartitionFactory: Build (total	none	49.31
MueLu: RepartitionFactory: Build (total	512	49.87
----------------------------------------------------------------------	--------	----------
----------------------------------------------------------------------	--------	----------
MueLu: RepartitionHeuristicFactory: Build (total	128	0.01811
MueLu: RepartitionHeuristicFactory: Build (total	2	0.01827
MueLu: RepartitionHeuristicFactory: Build (total	4	0.01828
MueLu: RepartitionHeuristicFactory: Build (total	none	0.01835
MueLu: RepartitionHeuristicFactory: Build (total	512	0.01839
MueLu: RepartitionHeuristicFactory: Build (total	64	0.01858
MueLu: RepartitionHeuristicFactory: Build (total	256	0.01868
MueLu: RepartitionHeuristicFactory: Build (total	16	0.01888
MueLu: RepartitionHeuristicFactory: Build (total	8	0.01907
MueLu: RepartitionHeuristicFactory: Build (total	32	0.0193
----------------------------------------------------------------------	--------	----------
----------------------------------------------------------------------	--------	----------
MueLu: SaPFactory: Prolongator smoothing (total	32	18.68
MueLu: SaPFactory: Prolongator smoothing (total	16	18.7
MueLu: SaPFactory: Prolongator smoothing (total	4	18.72
MueLu: SaPFactory: Prolongator smoothing (total	8	18.75
MueLu: SaPFactory: Prolongator smoothing (total	128	18.94
MueLu: SaPFactory: Prolongator smoothing (total	2	19
MueLu: SaPFactory: Prolongator smoothing (total	64	19.09
MueLu: SaPFactory: Prolongator smoothing (total	none	19.58
MueLu: SaPFactory: Prolongator smoothing (total	256	19.86
MueLu: SaPFactory: Prolongator smoothing (total	512	20.53
----------------------------------------------------------------------	--------	----------
----------------------------------------------------------------------	--------	----------
MueLu: TentativePFactory: Build (total	32	3.894
MueLu: TentativePFactory: Build (total	64	3.898
MueLu: TentativePFactory: Build (total	2	3.9
MueLu: TentativePFactory: Build (total	4	3.9
MueLu: TentativePFactory: Build (total	128	3.901
MueLu: TentativePFactory: Build (total	16	3.901
MueLu: TentativePFactory: Build (total	8	3.901
MueLu: TentativePFactory: Build (total	none	4.001
MueLu: TentativePFactory: Build (total	256	4.003
MueLu: TentativePFactory: Build (total	512	4.218
----------------------------------------------------------------------	--------	----------
----------------------------------------------------------------------	--------	----------
MueLu: UncoupledAggregationFactory: Build (total	16	2.5
MueLu: UncoupledAggregationFactory: Build (total	2	2.5
MueLu: UncoupledAggregationFactory: Build (total	32	2.5
MueLu: UncoupledAggregationFactory: Build (total	4	2.5
MueLu: UncoupledAggregationFactory: Build (total	8	2.501
MueLu: UncoupledAggregationFactory: Build (total	128	2.502
MueLu: UncoupledAggregationFactory: Build (total	64	2.502
MueLu: UncoupledAggregationFactory: Build (total	256	2.508
MueLu: UncoupledAggregationFactory: Build (total	none	2.527
MueLu: UncoupledAggregationFactory: Build (total	512	2.619
----------------------------------------------------------------------	--------	----------
----------------------------------------------------------------------	--------	----------
MueLu: Zoltan2Interface: Build (total	4	0.1807
MueLu: Zoltan2Interface: Build (total	8	0.1815
MueLu: Zoltan2Interface: Build (total	16	0.1818
MueLu: Zoltan2Interface: Build (total	64	0.1818
MueLu: Zoltan2Interface: Build (total	none	0.1832
MueLu: Zoltan2Interface: Build (total	2	0.1833
MueLu: Zoltan2Interface: Build (total	128	0.1835
MueLu: Zoltan2Interface: Build (total	32	0.1837
MueLu: Zoltan2Interface: Build (total	256	0.1968
MueLu: Zoltan2Interface: Build (total	512	0.2022
----------------------------------------------------------------------	-------	----------
----------------------------------------------------------------------	-------	----------

Edited to include the data when hugepages = none

jjellio commented 7 years ago

As a brief followup, you may noticed that a pagesize of 512MB is typically the worst, and that the timing can be significantly longer than the best. E.g., repartition, rebalance, RAP, Hierarchy...

What happens is that with a page size so large, the system is not able to allocate 512MB pages all the time. When a Hugepage allocation fails, it defaults to the system's pagesize, which is 4KB. In prior studies, we observed extreme performance variations (we did not use huge pages). It is very likely that those variations were due to poor page allocations, which result in poor cache utilization. (I use HBM in cache mode).

sayerhs commented 7 years ago

@jjellio Wow! Thank you for the comprehensive response. Answered a lot of my questions, and a few that I didn't know to ask. I'll use these tips to setup the Nalu runs on Cori.

jjellio commented 7 years ago

@sayerhs My experience is with Quad/Cache mode. I suspect that turning off HBM (quad/flat) may eliminate some issues, but things like hugepages will still show a benefit, because they improve the performance of the TLB.

From my experience, no apps I've used can do real science and live entirely in 15GB of HBM memory. Therefore I tested with HBM in cache mode (which is Cori's default).

sayerhs commented 7 years ago

@jjellio : Have you investigated the use of autohbw for array allocation, if so, what is your recommendation there? My understanding is that it is only used when using the node in flat mode, is that correct? Thanks in advance.

jjellio commented 7 years ago

I have not used it. You can restrict all allocations to HBM in flat mode with numactl -m 1, which will strictly bind the process into the HBM numa domain. Once 15GB is reached, malloc's will fail and the code will crash.

You can also use preferred binding srun numactl -p 1 ./myapp

With preferred, the app will try to allocate in HBM first, but if that fails it will fall back to another NUMA domain (DDR4).

How will autohbw provide utility beyond numactl memory binding? Apparently, you can restrict autohbw so that it will only use HBM for allocations of a certain size. That seems to be its selling point (from what I have read).

I have avoided flat mode + preferred binding because I do not want the case that some processes perform slightly faster than others. Since you do not know what is HBM and what is not, this seems like it would only lead to confusing performance analysis.

jhux2 commented 7 years ago

@jjellio @srajama1 Question from @sayerhs:

"I guess I don't understand how to turn off hyperthreading... is it just sufficient to set OMP_PLACES = cores for that and set -c flag with srun?"

ibaned commented 7 years ago

@dsunder

jjellio commented 7 years ago

There are a few ways you can avoid having multiple threads per core

Assume cores_per_process is the number of cores you are giving to each process

set OMP_NUM_THREADS needs to match the number of cores you are giving to each task.
set OMP_PROC_BIND=spread
set OMP_PLACES="cores($cores_per_process)" or OMP_PLACES=threads.
invoke srun (I am assuming this is SLURM) srun -c $(( 4 * $cores_per_process )) --cpu_bind=cores

What this will achieve: Srun will create a CPU binding mask that restricts your process to a fixed number of cores and all 4 HTs for each core.

OMP_PLACES will the tell OpenMP to bind each OpenMP thread so that it can use any resource inside a core (but none elsewhere).

OMP_PROC_BIND=spread, tells OpenMP to evenly distribute the threads across the process mask. Since you only have as many threads as there are cores, this will place 1 thread per core.

What all of the above effectively does, is that it binds one thread to each core. The thread can choose to run on any of the 4HTs in that core.

If you choose OMP_PLACES=threads, then it binds one thread to a single HT in each core, and that thread cannot change HTs within the core.

OMP_PLACES=threads should be the best choice, as it prevents unneeded movement of the thread within a core.

If you are using one thread per core, and your code spends most of its time in OpenMP regions, you might see a gain from setting OMP_WAIT_POLICY=active (never set this if you use HTs). On KNL, the default is usually 'passive'

You can also instruct SLURM to completely remove the HTs from the task's process mask. --hint=nomultithread. A typical KNL process mask is 68 bits repeated 4 times. A one indicates the process can run on that core, a zero means no. What nomultithread does, is it puts zeros in all 3x68 bits, and only puts ones in the first 68 bits of the mask. Effectively, this 'hides' the hardware threads from your process. I haven't looked to see how OpenMP behaves with this option + OMP_PLACES=cores. Ideally, OpenMP should see 68 cores with only 1 thread possible in each.

Note, we found an issue with using --hint=nomultithread It seems that on haswell and KNL, where it appears in the srun command can impact what happens. The last I recall, --hint needs to come after cpu_bind, e.g., srun -c <num> --cpu_bind=cores --hint=nomultithread

The above issue was pretty annoying to track down.

TLDR:

export OMP_PLACES=threads
export OMP_PROC_BIND=spread
export OMP_NUM_THREADS=${cores_per_proc}

srun -c $(( 4*${cores_per_proc} )) --cpu_bind=cores

Add --cpu_bind=cores,verbose, and you can see the 68x4 bit masks for each process.

github-actions[bot] commented 3 years ago

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. If you would like to keep this issue open please add a comment and remove the MARKED_FOR_CLOSURE label. If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.

github-actions[bot] commented 3 years ago

This issue was closed due to inactivity for 395 days.

trilinos / Trilinos

Cori KNL: best practices #1727