Documentation on performance tuning exercise

Just doing a mind dump of this past few days work for posterity here. I've done careful profiling and tuning for com08 with the following config:

(venv3) > $ lscpu                                                                                                                                                                                                                           
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              96
On-line CPU(s) list: 0-95
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
Stepping:            4
CPU MHz:             1000.815
BogoMIPS:            4200.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            33792K
NUMA node0 CPU(s):   0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94
NUMA node1 CPU(s):   1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d

Where the first 24 and the last 24 threads are elementwise colocated on the same physical core as hyper threads per NUMA node. I've assigned affinities as such:

taskset -c 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65...

This makes a large difference (30%) on run times. The dask threadpool is thus assigned to these numbers for however large the threadpool becomes (xaxis of plots below).

Memory layout as such. I didn't profile the memory footprint but it kept at 1/5th this size for the most part.

MemTotal:       528052508 kB
MemFree:        257133992 kB
MemAvailable:   520132728 kB
Buffers:         3864760 kB
Cached:         255178768 kB
SwapCached:       540016 kB
Active:         133666732 kB
Inactive:       126986284 kB
Active(anon):     755376 kB
Inactive(anon):   856168 kB
Active(file):   132911356 kB
Inactive(file): 126130116 kB
Unevictable:       30600 kB
Mlocked:           30600 kB
SwapTotal:      545259512 kB
SwapFree:       468834360 kB
Dirty:                92 kB
Writeback:             0 kB
AnonPages:       1577392 kB
Mapped:           126464 kB
Shmem:              2064 kB
Slab:            8896076 kB
SReclaimable:    7375300 kB
SUnreclaim:      1520776 kB
KernelStack:       26832 kB
PageTables:        41160 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    809285764 kB
Committed_AS:   83770652 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:    35453760 kB
DirectMap2M:    494800896 kB
DirectMap1G:     8388608 kB

iTLB miss ratios are high but in the grand scheme of data TLB accesses the misses are essentially negligible. What is more important is to tune the number of baselines per block to lower the L3 cache misses as discussed with @bmerry.

I used 112.61 GiB of data, 856 MHz band channelized to 208kHz resolution and dumped at 1s resolution to profile the flagger. Using any less actually starts breaking the strong scaling here. I suspect we start running into compiler / MAD flagger / DASK overheads in this regime. For small (<< 100gib MSv2 datasets (incl. metadata)) the scaling dramatically falls off a cliff. Python profiling with pprofile is inconclusive. I suspect the profiler does not take calls to external non-python libraries into account correctly for instance I'm really suspicious of the very low 0.02ish percent calls to casacore getcol and putcols for sizeable 10s of gib reads!! So I don't think we can trust the callgraph profile output. cprofile does not take threads into account so of limited use although I know it takes c calls into account correctly from DDF profiling. See below for a much smaller (~60gb 1k channelized 8s dumptime dataset). Here we run into weak scaling as mentioned

Following target-type strategy applied:


# List of strategies to apply in order
strategies:
    # only enable me if you really want to start from scratch
    # -
    #   name: reset_flags:
    #   task: unflag
    -
        name: nan_dropouts_flag
        task: flag_nans_zeros
    -
        name: background_static_mask
        task: apply_static_mask
        kwargs:
            accumulation_mode: "or"
            uvrange: ""
    -
        name: background_flags
        task: sum_threshold
        kwargs:
            outlier_nsigma: 15
            windows_time: [1, 2, 4, 8]
            windows_freq: [1, 2, 4, 8]
            background_reject: 2.0
            background_iterations: 5
            spike_width_time: 12.5
            spike_width_freq: 10.0
            time_extend: 3
            freq_extend: 3
            freq_chunks: 10
            average_freq: 1
            flag_all_time_frac: 0.6
            flag_all_freq_frac: 0.8
            rho: 1.3
            num_major_iterations: 3
    -
        name: residual_flag_initial
        task: uvcontsub_flagger
        kwargs:
            major_cycles: 3
            or_original_from_cycle: 1
            taylor_degrees: 20
            sigma: 15.0
    # flags are discarded at this point since we or from cycle 1
    # reflag nans and zeros
    -
        name: nan_dropouts_reflag
        task: flag_nans_zeros
    -
        name: uvrange_static_mask
        task: apply_static_mask
        kwargs:
            accumulation_mode: "or"
            uvrange: "0~1000"
    -
        name: final_st_very_broad
        task: sum_threshold
        kwargs:
            outlier_nsigma: 15
            windows_time: [1, 2, 4, 8]
            windows_freq: [32, 48, 64, 128]
            background_reject: 2.0
            background_iterations: 5
            spike_width_time: 6.5
            spike_width_freq: 64.0
            time_extend: 3
            freq_extend: 3
            freq_chunks: 10
            average_freq: 1
            flag_all_time_frac: 0.6
            flag_all_freq_frac: 0.8
            rho: 1.3
            num_major_iterations: 1
    -
        name: final_st_broad
        task: sum_threshold
        kwargs:
            outlier_nsigma: 15
            windows_time: [1, 2, 4, 8]
            windows_freq: [1, 2, 4, 8]
            background_reject: 2.0
            background_iterations: 5
            spike_width_time: 6.5
            spike_width_freq: 10.0
            time_extend: 3
            freq_extend: 3
            freq_chunks: 10
            average_freq: 1
            flag_all_time_frac: 0.6
            flag_all_freq_frac: 0.8
            rho: 1.3
            num_major_iterations: 1
    -
        name: final_st_narrow
        task: sum_threshold
        kwargs:
            outlier_nsigma: 15
            windows_time: [1, 2, 4, 8]
            windows_freq: [1, 2, 4, 8]
            background_reject: 2.0
            background_iterations: 5
            spike_width_time: 2
            spike_width_freq: 10.0
            time_extend: 3
            freq_extend: 3
            freq_chunks: 10
            average_freq: 1
            flag_all_time_frac: 0.6
            flag_all_freq_frac: 0.8
            rho: 1.3
            num_major_iterations: 1
    -
        name: residual_flag_final
        task: uvcontsub_flagger
        kwargs:
            major_cycles: 3
            or_original_from_cycle: 0
            taylor_degrees: 25
            sigma: 15.0
    -
        name: flag_autos
        task: flag_autos
    -
        name: combine_with_input_flags
        task: combine_with_input_flags

ratt-ru / tricolour

Documentation on performance tuning exercise #74