Closed jhux2 closed 3 years ago
Per @jjellio :
My experience with Cray machines has been:
1) Restrict jobs so that they span as few switches as possible. For slurm, this is via the --switches=N@
To figure out the number of switches (N), you need to understand the machine's layout. For Cori, two cabinets share a switch, so you need to compute the number of nodes per cabinet, times 2, and the minimum number of switches is then the ceiling of your desired node count by that magic number. E.g., on Cori this is 1 switch per 392 nodes.
2) Compile with hugepage support. You can use which ever pagesize you like, and unless your code depends in some way on the page size (hopefully it doesn't), you can then use whichever pagesize you want at runtime. (But you must compile with Hugepages, this modifies the link line, as well as sets important system variables that define the page size)
2.b) When you run codes compiled with a hugepage support, and you wish to use hugepages, you need to load that module before running. (This is when you can choose a pagesize as well). On Cray systems, the module system sets up environment variables that control the hugepage library that was linked. This is odd, since you now need to load a module from build time at runtime.
3) Make sure the KNL system admins have installed Intel's zonesort kernel module. If they have not (or will not), you should reboot the node before use. KNL performance in Cache mode can degrade over time due to ordering (and availability) of free memory pages. Zonesort compacts and sorts the list of free pages so that future memory allocations will not cause unnecessary HBM cache conflicts.
Rebooting may be the preferred ticket though. If Hugepages are used, then available memory can become fragmented to the point that the OS is unable to obtain a Hugepage, in this case the OS defaults to providing 4KB pages again. When this happens, you then observe performance variations between nodes that are use all hugepages and those that are using on a few hugepages + 4KB pages. This is easy to observe by using the largest hugepage sizes (e.g., 512MB)
4) Check the uptime of the system. If the node has been up for days, I would reboot it. It is infuriating that KNL is so fragile w.r.t. the system state, but my best runtimes are usually right after the system was down for maintenance (I believe that is due to all nodes in my allocation being freshly booted)
That said, rebooting is expensive. Most systems charge your allocation for reboot time, and it can take around 20mins for a node to fully power cycle. This is also risky, because the node may not come back. (power cycling seems to be hard the systems)
5) Pin your threads to cores or to hardware threads. I use OMP_PLACES=threads
I am actively looking into ways to control MPI progress threads. You really do not want any thread active waiting on a KNL core that has other threads bound to it. You are pretty much guaranteed to observe around a 50% performance hit, because the active wait will force the core to share those resources with all threads contending.
6) Don't run on all 68 cores (if on the 68 core variant), leave a few cores free for the OS. (slurm provides this via --core_spec=N, where N is the number of cores to reserve). There is atleast one paper that explored this. The recommendation was leave 2 or 4 free. (leaving 4 free makes the node much easier to use, since you will have 64 cores)
7) (Trilinos specific), I've seen very nice results with 1 task on a KNL node. This results in a SerialComm, and that seems to be a very good thing. E.g., try to use repartitioning that will ultimately pack a level unto a single node, and MueLu's level logic should be using SerialComms if only one task is around. My data to support this is from Tpetra's SpMV. Regardless this seems like a good design decision anyway.
8) You need to do all of the above, all the time. Zonesort, hugepages, and node uptime all work together to make the system more stable.
@jjellio Thanks for the comprehensive tips regarding compilation and running on KNL nodes. I had a few follow up question:
-lhugepage
during link time? Or is that something we should be manually adding to the CMake linker flags?No.
Since this thread is Cori specific, the way to use hugepages is:
1) module load craype-hugepages2M This will add the correct include and linker arguments
2) compile like normal. Cray uses static linking + compiler wrappers, so the mojo from 'module load' is hidden. You can inspect your shell environment to see it though, e.g., env | grep HUGE
3) Run the code**: module load craype-hugepages2M ./run
** I am unsure if you need to set specific MPICH variables that control hugepage use. You can set: export MPICH_ENV_DISPLAY=verbose Which will then dump the MPI environment when your code runs. (rank 0 dumps) There are a few HUGEPAGE settings, some are deprecated. It is not clear to me if MPICH will default to using Hugepages if these are not enabled.
The documentation for hugepages (man intro_hugepages), states that by default the Hugepage modules force alignment to be on 512MB boundaries. By doing so, they are effectively making is so you can use any Hugepage size, since 2MB pages will always align if 512MB alignment is used. This means you should be able to compile with any hugepage module, and run with a different one.
I performed a test that looked at hugepage size, the goal is to choose the smallest page size that offers the best performance. The reason for that logic, is that if the page size is too large, and your app has many procs per node, then you can reach a point where there is not a contiguous chunk of memory large enough to allocate.
This data is from Cori runs using flat MPI with 64 MPI tasks per KNL node, and one node. I wrote a complicated scoring system that tried to weight times and choose a winner. But I ended up looking at the data and choosing 2MB. It's rarely the best, but the relative difference between the best and 2MB is usually really small. I'd choose 2, 8 or 16 and not worry about it.
Each subgroup is sorted by the timing.
Truncated Timer Name | Huge Page Size | MinOverProcs (seconds) |
---|---|---|
MueLu: AggregationPhase2aAlgorithm: BuildAggregates (total | 128 | 0.3194 |
MueLu: AggregationPhase2aAlgorithm: BuildAggregates (total | 256 | 0.3197 |
MueLu: AggregationPhase2aAlgorithm: BuildAggregates (total | 4 | 0.3197 |
MueLu: AggregationPhase2aAlgorithm: BuildAggregates (total | 64 | 0.3197 |
MueLu: AggregationPhase2aAlgorithm: BuildAggregates (total | 8 | 0.3197 |
MueLu: AggregationPhase2aAlgorithm: BuildAggregates (total | 16 | 0.3201 |
MueLu: AggregationPhase2aAlgorithm: BuildAggregates (total | 2 | 0.3203 |
MueLu: AggregationPhase2aAlgorithm: BuildAggregates (total | 32 | 0.3207 |
MueLu: AggregationPhase2aAlgorithm: BuildAggregates (total | 512 | 0.3213 |
MueLu: AggregationPhase2aAlgorithm: BuildAggregates (total | none | 0.3213 |
---------------------------------------------------------------------- | -------- | ---------- |
---------------------------------------------------------------------- | -------- | ---------- |
MueLu: AggregationPhase2bAlgorithm: BuildAggregates (total | 4 | 0.6041 |
MueLu: AggregationPhase2bAlgorithm: BuildAggregates (total | 128 | 0.6045 |
MueLu: AggregationPhase2bAlgorithm: BuildAggregates (total | 16 | 0.6049 |
MueLu: AggregationPhase2bAlgorithm: BuildAggregates (total | 256 | 0.605 |
MueLu: AggregationPhase2bAlgorithm: BuildAggregates (total | 64 | 0.6051 |
MueLu: AggregationPhase2bAlgorithm: BuildAggregates (total | 2 | 0.6052 |
MueLu: AggregationPhase2bAlgorithm: BuildAggregates (total | 8 | 0.6054 |
MueLu: AggregationPhase2bAlgorithm: BuildAggregates (total | 32 | 0.6056 |
MueLu: AggregationPhase2bAlgorithm: BuildAggregates (total | 512 | 0.6075 |
MueLu: AggregationPhase2bAlgorithm: BuildAggregates (total | none | 0.6086 |
---------------------------------------------------------------------- | -------- | ---------- |
---------------------------------------------------------------------- | -------- | ---------- |
MueLu: AggregationPhase3Algorithm: BuildAggregates (total | 4 | 0.01899 |
MueLu: AggregationPhase3Algorithm: BuildAggregates (total | 64 | 0.01911 |
MueLu: AggregationPhase3Algorithm: BuildAggregates (total | 128 | 0.01914 |
MueLu: AggregationPhase3Algorithm: BuildAggregates (total | 512 | 0.01914 |
MueLu: AggregationPhase3Algorithm: BuildAggregates (total | 2 | 0.01918 |
MueLu: AggregationPhase3Algorithm: BuildAggregates (total | 256 | 0.01918 |
MueLu: AggregationPhase3Algorithm: BuildAggregates (total | 16 | 0.01919 |
MueLu: AggregationPhase3Algorithm: BuildAggregates (total | 32 | 0.01922 |
MueLu: AggregationPhase3Algorithm: BuildAggregates (total | 8 | 0.01931 |
MueLu: AggregationPhase3Algorithm: BuildAggregates (total | none | 0.01968 |
---------------------------------------------------------------------- | -------- | ---------- |
---------------------------------------------------------------------- | -------- | ---------- |
MueLu: AmalgamationFactory: Build (total | 16 | 0.006072 |
MueLu: AmalgamationFactory: Build (total | 128 | 0.006167 |
MueLu: AmalgamationFactory: Build (total | 8 | 0.006172 |
MueLu: AmalgamationFactory: Build (total | 512 | 0.006173 |
MueLu: AmalgamationFactory: Build (total | 64 | 0.006196 |
MueLu: AmalgamationFactory: Build (total | 256 | 0.006197 |
MueLu: AmalgamationFactory: Build (total | none | 0.006198 |
MueLu: AmalgamationFactory: Build (total | 2 | 0.006238 |
MueLu: AmalgamationFactory: Build (total | 4 | 0.006259 |
MueLu: AmalgamationFactory: Build (total | 32 | 0.006287 |
---------------------------------------------------------------------- | -------- | ---------- |
---------------------------------------------------------------------- | -------- | ---------- |
MueLu: CoalesceDropFactory: Build (total | 16 | 1.083 |
MueLu: CoalesceDropFactory: Build (total | 64 | 1.088 |
MueLu: CoalesceDropFactory: Build (total | 8 | 1.089 |
MueLu: CoalesceDropFactory: Build (total | 4 | 1.094 |
MueLu: CoalesceDropFactory: Build (total | 2 | 1.096 |
MueLu: CoalesceDropFactory: Build (total | 32 | 1.102 |
MueLu: CoalesceDropFactory: Build (total | none | 1.102 |
MueLu: CoalesceDropFactory: Build (total | 256 | 1.119 |
MueLu: CoalesceDropFactory: Build (total | 128 | 1.122 |
MueLu: CoalesceDropFactory: Build (total | 512 | 1.153 |
---------------------------------------------------------------------- | -------- | ---------- |
---------------------------------------------------------------------- | -------- | ---------- |
MueLu: CoarseMapFactory: Build (total | 32 | 0.01817 |
MueLu: CoarseMapFactory: Build (total | 16 | 0.01839 |
MueLu: CoarseMapFactory: Build (total | 2 | 0.01839 |
MueLu: CoarseMapFactory: Build (total | 128 | 0.01851 |
MueLu: CoarseMapFactory: Build (total | 8 | 0.01878 |
MueLu: CoarseMapFactory: Build (total | none | 0.01883 |
MueLu: CoarseMapFactory: Build (total | 4 | 0.01886 |
MueLu: CoarseMapFactory: Build (total | 256 | 0.01921 |
MueLu: CoarseMapFactory: Build (total | 512 | 0.01925 |
MueLu: CoarseMapFactory: Build (total | 64 | 0.01925 |
---------------------------------------------------------------------- | -------- | ---------- |
---------------------------------------------------------------------- | -------- | ---------- |
MueLu: CoordinatesTransferFactory: Build (total | 2 | 0.1842 |
MueLu: CoordinatesTransferFactory: Build (total | 32 | 0.1849 |
MueLu: CoordinatesTransferFactory: Build (total | 16 | 0.1851 |
MueLu: CoordinatesTransferFactory: Build (total | 4 | 0.1851 |
MueLu: CoordinatesTransferFactory: Build (total | 64 | 0.1853 |
MueLu: CoordinatesTransferFactory: Build (total | 8 | 0.186 |
MueLu: CoordinatesTransferFactory: Build (total | 256 | 0.1861 |
MueLu: CoordinatesTransferFactory: Build (total | 128 | 0.1869 |
MueLu: CoordinatesTransferFactory: Build (total | 512 | 0.1909 |
MueLu: CoordinatesTransferFactory: Build (total | none | 0.1993 |
---------------------------------------------------------------------- | -------- | ---------- |
---------------------------------------------------------------------- | -------- | ---------- |
MueLu: FilteredAFactory: Matrix filtering (total | 16 | 1.098 |
MueLu: FilteredAFactory: Matrix filtering (total | 64 | 1.103 |
MueLu: FilteredAFactory: Matrix filtering (total | 8 | 1.104 |
MueLu: FilteredAFactory: Matrix filtering (total | 4 | 1.109 |
MueLu: FilteredAFactory: Matrix filtering (total | 2 | 1.111 |
MueLu: FilteredAFactory: Matrix filtering (total | none | 1.116 |
MueLu: FilteredAFactory: Matrix filtering (total | 32 | 1.117 |
MueLu: FilteredAFactory: Matrix filtering (total | 256 | 1.135 |
MueLu: FilteredAFactory: Matrix filtering (total | 128 | 1.137 |
MueLu: FilteredAFactory: Matrix filtering (total | 512 | 1.169 |
---------------------------------------------------------------------- | -------- | ---------- |
---------------------------------------------------------------------- | -------- | ---------- |
MueLu: Hierarchy: Setup (total | 16 | 55.7 |
MueLu: Hierarchy: Setup (total | 32 | 55.73 |
MueLu: Hierarchy: Setup (total | 4 | 55.73 |
MueLu: Hierarchy: Setup (total | 8 | 55.81 |
MueLu: Hierarchy: Setup (total | 128 | 56.18 |
MueLu: Hierarchy: Setup (total | 2 | 56.19 |
MueLu: Hierarchy: Setup (total | 64 | 56.23 |
MueLu: Hierarchy: Setup (total | 256 | 58.17 |
MueLu: Hierarchy: Setup (total | none | 59.33 |
MueLu: Hierarchy: Setup (total | 512 | 60.83 |
---------------------------------------------------------------------- | -------- | ---------- |
---------------------------------------------------------------------- | -------- | ---------- |
MueLu: Ifpack2Smoother: Setup Smoother (total | 16 | 9.142 |
MueLu: Ifpack2Smoother: Setup Smoother (total | 2 | 9.154 |
MueLu: Ifpack2Smoother: Setup Smoother (total | 32 | 9.186 |
MueLu: Ifpack2Smoother: Setup Smoother (total | 8 | 9.186 |
MueLu: Ifpack2Smoother: Setup Smoother (total | 64 | 9.199 |
MueLu: Ifpack2Smoother: Setup Smoother (total | 4 | 9.206 |
MueLu: Ifpack2Smoother: Setup Smoother (total | 128 | 9.219 |
MueLu: Ifpack2Smoother: Setup Smoother (total | none | 9.458 |
MueLu: Ifpack2Smoother: Setup Smoother (total | 256 | 9.595 |
MueLu: Ifpack2Smoother: Setup Smoother (total | 512 | 10.37 |
---------------------------------------------------------------------- | -------- | ---------- |
---------------------------------------------------------------------- | -------- | ---------- |
MueLu: NullspaceFactory: Nullspace factory (total | 128 | 0.003001 |
MueLu: NullspaceFactory: Nullspace factory (total | none | 0.003043 |
MueLu: NullspaceFactory: Nullspace factory (total | 256 | 0.00305 |
MueLu: NullspaceFactory: Nullspace factory (total | 4 | 0.00306 |
MueLu: NullspaceFactory: Nullspace factory (total | 64 | 0.003092 |
MueLu: NullspaceFactory: Nullspace factory (total | 32 | 0.003115 |
MueLu: NullspaceFactory: Nullspace factory (total | 512 | 0.003119 |
MueLu: NullspaceFactory: Nullspace factory (total | 16 | 0.003122 |
MueLu: NullspaceFactory: Nullspace factory (total | 8 | 0.003136 |
MueLu: NullspaceFactory: Nullspace factory (total | 2 | 0.003422 |
---------------------------------------------------------------------- | -------- | ---------- |
---------------------------------------------------------------------- | -------- | ---------- |
MueLu: PreserveDirichletAggregationAlgorithm: BuildAggregates (total | 512 | 0.0137 |
MueLu: PreserveDirichletAggregationAlgorithm: BuildAggregates (total | 64 | 0.01372 |
MueLu: PreserveDirichletAggregationAlgorithm: BuildAggregates (total | 128 | 0.01374 |
MueLu: PreserveDirichletAggregationAlgorithm: BuildAggregates (total | 8 | 0.01376 |
MueLu: PreserveDirichletAggregationAlgorithm: BuildAggregates (total | 2 | 0.0138 |
MueLu: PreserveDirichletAggregationAlgorithm: BuildAggregates (total | 32 | 0.01387 |
MueLu: PreserveDirichletAggregationAlgorithm: BuildAggregates (total | 4 | 0.01399 |
MueLu: PreserveDirichletAggregationAlgorithm: BuildAggregates (total | 16 | 0.01402 |
MueLu: PreserveDirichletAggregationAlgorithm: BuildAggregates (total | 256 | 0.01407 |
MueLu: PreserveDirichletAggregationAlgorithm: BuildAggregates (total | none | 0.01415 |
---------------------------------------------------------------------- | -------- | ---------- |
---------------------------------------------------------------------- | -------- | ---------- |
MueLu: RAPFactory: Computing Ac (total | 4 | 45.46 |
MueLu: RAPFactory: Computing Ac (total | 32 | 45.48 |
MueLu: RAPFactory: Computing Ac (total | 16 | 45.49 |
MueLu: RAPFactory: Computing Ac (total | 8 | 45.56 |
MueLu: RAPFactory: Computing Ac (total | 128 | 45.89 |
MueLu: RAPFactory: Computing Ac (total | 64 | 45.97 |
MueLu: RAPFactory: Computing Ac (total | 2 | 45.98 |
MueLu: RAPFactory: Computing Ac (total | 256 | 47.47 |
MueLu: RAPFactory: Computing Ac (total | none | 48.79 |
MueLu: RAPFactory: Computing Ac (total | 512 | 49.3 |
---------------------------------------------------------------------- | -------- | ---------- |
---------------------------------------------------------------------- | -------- | ---------- |
MueLu: RebalanceAcFactory: Computing Ac (total | 2 | 0.354 |
MueLu: RebalanceAcFactory: Computing Ac (total | 8 | 0.3547 |
MueLu: RebalanceAcFactory: Computing Ac (total | 128 | 0.3561 |
MueLu: RebalanceAcFactory: Computing Ac (total | 32 | 0.3564 |
MueLu: RebalanceAcFactory: Computing Ac (total | 4 | 0.3565 |
MueLu: RebalanceAcFactory: Computing Ac (total | 64 | 0.358 |
MueLu: RebalanceAcFactory: Computing Ac (total | 16 | 0.3591 |
MueLu: RebalanceAcFactory: Computing Ac (total | none | 0.3609 |
MueLu: RebalanceAcFactory: Computing Ac (total | 256 | 0.3668 |
MueLu: RebalanceAcFactory: Computing Ac (total | 512 | 0.3676 |
---------------------------------------------------------------------- | -------- | ---------- |
---------------------------------------------------------------------- | -------- | ---------- |
MueLu: RebalanceTransferFactory: Build (total | 4 | 46.02 |
MueLu: RebalanceTransferFactory: Build (total | 16 | 46.05 |
MueLu: RebalanceTransferFactory: Build (total | 32 | 46.05 |
MueLu: RebalanceTransferFactory: Build (total | 8 | 46.13 |
MueLu: RebalanceTransferFactory: Build (total | 128 | 46.46 |
MueLu: RebalanceTransferFactory: Build (total | 64 | 46.54 |
MueLu: RebalanceTransferFactory: Build (total | 2 | 46.55 |
MueLu: RebalanceTransferFactory: Build (total | 256 | 48.06 |
MueLu: RebalanceTransferFactory: Build (total | none | 49.37 |
MueLu: RebalanceTransferFactory: Build (total | 512 | 49.93 |
---------------------------------------------------------------------- | -------- | ---------- |
---------------------------------------------------------------------- | -------- | ---------- |
MueLu: RepartitionFactory: Build (total | 4 | 45.97 |
MueLu: RepartitionFactory: Build (total | 32 | 45.99 |
MueLu: RepartitionFactory: Build (total | 16 | 46 |
MueLu: RepartitionFactory: Build (total | 8 | 46.07 |
MueLu: RepartitionFactory: Build (total | 128 | 46.41 |
MueLu: RepartitionFactory: Build (total | 64 | 46.48 |
MueLu: RepartitionFactory: Build (total | 2 | 46.49 |
MueLu: RepartitionFactory: Build (total | 256 | 48.01 |
MueLu: RepartitionFactory: Build (total | none | 49.31 |
MueLu: RepartitionFactory: Build (total | 512 | 49.87 |
---------------------------------------------------------------------- | -------- | ---------- |
---------------------------------------------------------------------- | -------- | ---------- |
MueLu: RepartitionHeuristicFactory: Build (total | 128 | 0.01811 |
MueLu: RepartitionHeuristicFactory: Build (total | 2 | 0.01827 |
MueLu: RepartitionHeuristicFactory: Build (total | 4 | 0.01828 |
MueLu: RepartitionHeuristicFactory: Build (total | none | 0.01835 |
MueLu: RepartitionHeuristicFactory: Build (total | 512 | 0.01839 |
MueLu: RepartitionHeuristicFactory: Build (total | 64 | 0.01858 |
MueLu: RepartitionHeuristicFactory: Build (total | 256 | 0.01868 |
MueLu: RepartitionHeuristicFactory: Build (total | 16 | 0.01888 |
MueLu: RepartitionHeuristicFactory: Build (total | 8 | 0.01907 |
MueLu: RepartitionHeuristicFactory: Build (total | 32 | 0.0193 |
---------------------------------------------------------------------- | -------- | ---------- |
---------------------------------------------------------------------- | -------- | ---------- |
MueLu: SaPFactory: Prolongator smoothing (total | 32 | 18.68 |
MueLu: SaPFactory: Prolongator smoothing (total | 16 | 18.7 |
MueLu: SaPFactory: Prolongator smoothing (total | 4 | 18.72 |
MueLu: SaPFactory: Prolongator smoothing (total | 8 | 18.75 |
MueLu: SaPFactory: Prolongator smoothing (total | 128 | 18.94 |
MueLu: SaPFactory: Prolongator smoothing (total | 2 | 19 |
MueLu: SaPFactory: Prolongator smoothing (total | 64 | 19.09 |
MueLu: SaPFactory: Prolongator smoothing (total | none | 19.58 |
MueLu: SaPFactory: Prolongator smoothing (total | 256 | 19.86 |
MueLu: SaPFactory: Prolongator smoothing (total | 512 | 20.53 |
---------------------------------------------------------------------- | -------- | ---------- |
---------------------------------------------------------------------- | -------- | ---------- |
MueLu: TentativePFactory: Build (total | 32 | 3.894 |
MueLu: TentativePFactory: Build (total | 64 | 3.898 |
MueLu: TentativePFactory: Build (total | 2 | 3.9 |
MueLu: TentativePFactory: Build (total | 4 | 3.9 |
MueLu: TentativePFactory: Build (total | 128 | 3.901 |
MueLu: TentativePFactory: Build (total | 16 | 3.901 |
MueLu: TentativePFactory: Build (total | 8 | 3.901 |
MueLu: TentativePFactory: Build (total | none | 4.001 |
MueLu: TentativePFactory: Build (total | 256 | 4.003 |
MueLu: TentativePFactory: Build (total | 512 | 4.218 |
---------------------------------------------------------------------- | -------- | ---------- |
---------------------------------------------------------------------- | -------- | ---------- |
MueLu: UncoupledAggregationFactory: Build (total | 16 | 2.5 |
MueLu: UncoupledAggregationFactory: Build (total | 2 | 2.5 |
MueLu: UncoupledAggregationFactory: Build (total | 32 | 2.5 |
MueLu: UncoupledAggregationFactory: Build (total | 4 | 2.5 |
MueLu: UncoupledAggregationFactory: Build (total | 8 | 2.501 |
MueLu: UncoupledAggregationFactory: Build (total | 128 | 2.502 |
MueLu: UncoupledAggregationFactory: Build (total | 64 | 2.502 |
MueLu: UncoupledAggregationFactory: Build (total | 256 | 2.508 |
MueLu: UncoupledAggregationFactory: Build (total | none | 2.527 |
MueLu: UncoupledAggregationFactory: Build (total | 512 | 2.619 |
---------------------------------------------------------------------- | -------- | ---------- |
---------------------------------------------------------------------- | -------- | ---------- |
MueLu: Zoltan2Interface: Build (total | 4 | 0.1807 |
MueLu: Zoltan2Interface: Build (total | 8 | 0.1815 |
MueLu: Zoltan2Interface: Build (total | 16 | 0.1818 |
MueLu: Zoltan2Interface: Build (total | 64 | 0.1818 |
MueLu: Zoltan2Interface: Build (total | none | 0.1832 |
MueLu: Zoltan2Interface: Build (total | 2 | 0.1833 |
MueLu: Zoltan2Interface: Build (total | 128 | 0.1835 |
MueLu: Zoltan2Interface: Build (total | 32 | 0.1837 |
MueLu: Zoltan2Interface: Build (total | 256 | 0.1968 |
MueLu: Zoltan2Interface: Build (total | 512 | 0.2022 |
---------------------------------------------------------------------- | ------- | ---------- |
---------------------------------------------------------------------- | ------- | ---------- |
Edited to include the data when hugepages = none
As a brief followup, you may noticed that a pagesize of 512MB is typically the worst, and that the timing can be significantly longer than the best. E.g., repartition, rebalance, RAP, Hierarchy...
What happens is that with a page size so large, the system is not able to allocate 512MB pages all the time. When a Hugepage allocation fails, it defaults to the system's pagesize, which is 4KB. In prior studies, we observed extreme performance variations (we did not use huge pages). It is very likely that those variations were due to poor page allocations, which result in poor cache utilization. (I use HBM in cache mode).
@jjellio Wow! Thank you for the comprehensive response. Answered a lot of my questions, and a few that I didn't know to ask. I'll use these tips to setup the Nalu runs on Cori.
@sayerhs My experience is with Quad/Cache mode. I suspect that turning off HBM (quad/flat) may eliminate some issues, but things like hugepages will still show a benefit, because they improve the performance of the TLB.
From my experience, no apps I've used can do real science and live entirely in 15GB of HBM memory. Therefore I tested with HBM in cache mode (which is Cori's default).
@jjellio : Have you investigated the use of autohbw
for array allocation, if so, what is your recommendation there? My understanding is that it is only used when using the node in flat
mode, is that correct? Thanks in advance.
I have not used it. You can restrict all allocations to HBM in flat mode with numactl -m 1, which will strictly bind the process into the HBM numa domain. Once 15GB is reached, malloc's will fail and the code will crash.
You can also use preferred binding srun
With preferred, the app will try to allocate in HBM first, but if that fails it will fall back to another NUMA domain (DDR4).
How will autohbw provide utility beyond numactl memory binding? Apparently, you can restrict autohbw so that it will only use HBM for allocations of a certain size. That seems to be its selling point (from what I have read).
I have avoided flat mode + preferred binding because I do not want the case that some processes perform slightly faster than others. Since you do not know what is HBM and what is not, this seems like it would only lead to confusing performance analysis.
@jjellio @srajama1 Question from @sayerhs:
"I guess I don't understand how to turn off hyperthreading... is it just sufficient to set OMP_PLACES = cores for that and set -c flag with srun?"
@dsunder
There are a few ways you can avoid having multiple threads per core
Assume cores_per_process is the number of cores you are giving to each process
srun -c $(( 4 * $cores_per_process )) --cpu_bind=cores
What this will achieve: Srun will create a CPU binding mask that restricts your process to a fixed number of cores and all 4 HTs for each core.
OMP_PLACES will the tell OpenMP to bind each OpenMP thread so that it can use any resource inside a core (but none elsewhere).
OMP_PROC_BIND=spread, tells OpenMP to evenly distribute the threads across the process mask. Since you only have as many threads as there are cores, this will place 1 thread per core.
What all of the above effectively does, is that it binds one thread to each core. The thread can choose to run on any of the 4HTs in that core.
If you choose OMP_PLACES=threads, then it binds one thread to a single HT in each core, and that thread cannot change HTs within the core.
OMP_PLACES=threads should be the best choice, as it prevents unneeded movement of the thread within a core.
If you are using one thread per core, and your code spends most of its time in OpenMP regions, you might see a gain from setting OMP_WAIT_POLICY=active (never set this if you use HTs). On KNL, the default is usually 'passive'
You can also instruct SLURM to completely remove the HTs from the task's process mask. --hint=nomultithread. A typical KNL process mask is 68 bits repeated 4 times. A one indicates the process can run on that core, a zero means no. What nomultithread does, is it puts zeros in all 3x68 bits, and only puts ones in the first 68 bits of the mask. Effectively, this 'hides' the hardware threads from your process. I haven't looked to see how OpenMP behaves with this option + OMP_PLACES=cores. Ideally, OpenMP should see 68 cores with only 1 thread possible in each.
srun -c <num> --cpu_bind=cores --hint=nomultithread
The above issue was pretty annoying to track down.
TLDR:
export OMP_PLACES=threads
export OMP_PROC_BIND=spread
export OMP_NUM_THREADS=${cores_per_proc}
srun -c $(( 4*${cores_per_proc} )) --cpu_bind=cores
Add --cpu_bind=cores,verbose, and you can see the 68x4 bit masks for each process.
This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and remove the MARKED_FOR_CLOSURE
label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE
.
This issue was closed due to inactivity for 395 days.
I am opening this issue to record best practices for achieving good performance on the Cori KNL partition. @jjellio Could I ask you to document your recommendations in this ticket? I've had a number of questions regarding this. Thanks!
@sayerhs @aprokop @alanw0 @spdomin