tobigithub / quantum-xtb

Benchmarks and examples for Grimme's semiempirical GFNn-xTB
3 stars 0 forks source link

check omp export OMP_MAX_ACTIVE_LEVELS #17

Open tobigithub opened 4 years ago

tobigithub commented 4 years ago

export OMP_MAX_ACTIVE_LEVELS=0 creates bad performance, maybe due to bound processors, need to check on clean environment.

OMP_STACKSIZE=1G
OMP_PROC_BIND=true
MKL_NUM_THREADS=10
OMP_MAX_ACTIVE_LEVELS=20
OMP_NUM_THREADS=44

time xtb mol-269.xyz --opt extreme
------------------------------------------------------------------------
 * finished run on 2020/03/06 at 07:14:13.300
------------------------------------------------------------------------
 total:
 * wall-time:     0 d,  0 h,  0 min, 41.888 sec
 *  cpu-time:     0 d,  0 h,  0 min, 41.854 sec
 * ratio c/w:     0.999 speedup
 SCF:
 * wall-time:     0 d,  0 h,  0 min,  0.230 sec
 *  cpu-time:     0 d,  0 h,  0 min,  0.230 sec
 * ratio c/w:     1.000 speedup
 ANC optimizer:
 * wall-time:     0 d,  0 h,  0 min, 41.432 sec
 *  cpu-time:     0 d,  0 h,  0 min, 41.431 sec
 * ratio c/w:     1.000 speedup

normal termination of xtb

real    0m41.892s
user    0m41.377s
sys     0m0.481s

export OMP_MAX_ACTIVE_LEVELS=1 creates double the performance

------------------------------------------------------------------------
 * finished run on 2020/03/06 at 07:16:25.940
------------------------------------------------------------------------
 total:
 * wall-time:     0 d,  0 h,  0 min, 19.201 sec
 *  cpu-time:     0 d,  0 h, 14 min,  3.485 sec
 * ratio c/w:    43.930 speedup
 SCF:
 * wall-time:     0 d,  0 h,  0 min,  0.123 sec
 *  cpu-time:     0 d,  0 h,  0 min,  5.354 sec
 * ratio c/w:    43.596 speedup
 ANC optimizer:
 * wall-time:     0 d,  0 h,  0 min, 18.958 sec
 *  cpu-time:     0 d,  0 h, 13 min, 53.968 sec
 * ratio c/w:    43.991 speedup

normal termination of xtb

real    0m19.211s
user    13m34.691s
sys     0m28.855s

export OMP_NUM_THREADS=22,44 export OMP_NUM_THREADS=22,1

lower thread count is faster, due to less thermal throttling and higher Mhz clock, despite 96 threads available on CPU. That means in this case a $12,000 Dollar CPU is as fast as a $1000 Dollar CPU.

------------------------------------------------------------------------
 * finished run on 2020/03/06 at 07:28:57.110
------------------------------------------------------------------------
 total:
 * wall-time:     0 d,  0 h,  0 min, 17.418 sec
 *  cpu-time:     0 d,  0 h,  6 min, 22.387 sec
 * ratio c/w:    21.953 speedup
 SCF:
 * wall-time:     0 d,  0 h,  0 min,  0.107 sec
 *  cpu-time:     0 d,  0 h,  0 min,  2.338 sec
 * ratio c/w:    21.787 speedup
 ANC optimizer:
 * wall-time:     0 d,  0 h,  0 min, 17.214 sec
 *  cpu-time:     0 d,  0 h,  6 min, 18.502 sec
 * ratio c/w:    21.988 speedup

normal termination of xtb

real    0m17.427s
user    6m10.136s
sys     0m12.384s

LOL: Recommended Customer Price $13012.0 via Intel ARK

~/intel$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              96
On-line CPU(s) list: 0-95
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
Stepping:            7
CPU MHz:             3149.476
BogoMIPS:            6000.00
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            36608K
NUMA node0 CPU(s):   0-23,48-71
NUMA node1 CPU(s):   24-47,72-95
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni

Recommended Customer Price: $1089.00 ARK

xtb mol-269.xyz --opt extreme
------------------------------------------------------------------------
 * finished run on 2020/03/05 at 23:16:54.321
------------------------------------------------------------------------
 total:
 * wall-time:     0 d,  0 h,  0 min, 21.252 sec
 *  cpu-time:     0 d,  0 h,  2 min, 46.937 sec
 * ratio c/w:     7.855 speedup
 SCF:
 * wall-time:     0 d,  0 h,  0 min,  0.137 sec
 *  cpu-time:     0 d,  0 h,  0 min,  1.080 sec
 * ratio c/w:     7.863 speedup
 ANC optimizer:
 * wall-time:     0 d,  0 h,  0 min, 21.016 sec
 *  cpu-time:     0 d,  0 h,  2 min, 45.117 sec
 * ratio c/w:     7.857 speedup

normal termination of xtb

lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Core(TM) i7-6900K CPU @ 3.20GHz
Stepping:              1
CPU MHz:               1198.860
CPU max MHz:           4000.0000
CPU min MHz:           1200.0000
BogoMIPS:              6385.72
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0-15
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d