Open bennahugo opened 4 years ago
Following target-type strategy applied:
# List of strategies to apply in order
strategies:
# only enable me if you really want to start from scratch
# -
# name: reset_flags:
# task: unflag
-
name: nan_dropouts_flag
task: flag_nans_zeros
-
name: background_static_mask
task: apply_static_mask
kwargs:
accumulation_mode: "or"
uvrange: ""
-
name: background_flags
task: sum_threshold
kwargs:
outlier_nsigma: 15
windows_time: [1, 2, 4, 8]
windows_freq: [1, 2, 4, 8]
background_reject: 2.0
background_iterations: 5
spike_width_time: 12.5
spike_width_freq: 10.0
time_extend: 3
freq_extend: 3
freq_chunks: 10
average_freq: 1
flag_all_time_frac: 0.6
flag_all_freq_frac: 0.8
rho: 1.3
num_major_iterations: 3
-
name: residual_flag_initial
task: uvcontsub_flagger
kwargs:
major_cycles: 3
or_original_from_cycle: 1
taylor_degrees: 20
sigma: 15.0
# flags are discarded at this point since we or from cycle 1
# reflag nans and zeros
-
name: nan_dropouts_reflag
task: flag_nans_zeros
-
name: uvrange_static_mask
task: apply_static_mask
kwargs:
accumulation_mode: "or"
uvrange: "0~1000"
-
name: final_st_very_broad
task: sum_threshold
kwargs:
outlier_nsigma: 15
windows_time: [1, 2, 4, 8]
windows_freq: [32, 48, 64, 128]
background_reject: 2.0
background_iterations: 5
spike_width_time: 6.5
spike_width_freq: 64.0
time_extend: 3
freq_extend: 3
freq_chunks: 10
average_freq: 1
flag_all_time_frac: 0.6
flag_all_freq_frac: 0.8
rho: 1.3
num_major_iterations: 1
-
name: final_st_broad
task: sum_threshold
kwargs:
outlier_nsigma: 15
windows_time: [1, 2, 4, 8]
windows_freq: [1, 2, 4, 8]
background_reject: 2.0
background_iterations: 5
spike_width_time: 6.5
spike_width_freq: 10.0
time_extend: 3
freq_extend: 3
freq_chunks: 10
average_freq: 1
flag_all_time_frac: 0.6
flag_all_freq_frac: 0.8
rho: 1.3
num_major_iterations: 1
-
name: final_st_narrow
task: sum_threshold
kwargs:
outlier_nsigma: 15
windows_time: [1, 2, 4, 8]
windows_freq: [1, 2, 4, 8]
background_reject: 2.0
background_iterations: 5
spike_width_time: 2
spike_width_freq: 10.0
time_extend: 3
freq_extend: 3
freq_chunks: 10
average_freq: 1
flag_all_time_frac: 0.6
flag_all_freq_frac: 0.8
rho: 1.3
num_major_iterations: 1
-
name: residual_flag_final
task: uvcontsub_flagger
kwargs:
major_cycles: 3
or_original_from_cycle: 0
taylor_degrees: 25
sigma: 15.0
-
name: flag_autos
task: flag_autos
-
name: combine_with_input_flags
task: combine_with_input_flags
Just doing a mind dump of this past few days work for posterity here. I've done careful profiling and tuning for com08 with the following config:
Where the first 24 and the last 24 threads are elementwise colocated on the same physical core as hyper threads per NUMA node. I've assigned affinities as such:
This makes a large difference (30%) on run times. The dask threadpool is thus assigned to these numbers for however large the threadpool becomes (xaxis of plots below).
Memory layout as such. I didn't profile the memory footprint but it kept at 1/5th this size for the most part.
iTLB miss ratios are high but in the grand scheme of data TLB accesses the misses are essentially negligible. What is more important is to tune the number of baselines per block to lower the L3 cache misses as discussed with @bmerry.
I used 112.61 GiB of data, 856 MHz band channelized to 208kHz resolution and dumped at 1s resolution to profile the flagger. Using any less actually starts breaking the strong scaling here. I suspect we start running into compiler / MAD flagger / DASK overheads in this regime. For small (<< 100gib MSv2 datasets (incl. metadata)) the scaling dramatically falls off a cliff. Python profiling with pprofile is inconclusive. I suspect the profiler does not take calls to external non-python libraries into account correctly for instance I'm really suspicious of the very low 0.02ish percent calls to casacore getcol and putcols for sizeable 10s of gib reads!! So I don't think we can trust the callgraph profile output. cprofile does not take threads into account so of limited use although I know it takes c calls into account correctly from DDF profiling. See below for a much smaller (~60gb 1k channelized 8s dumptime dataset). Here we run into weak scaling as mentioned