52 GB scan arrays create NUMA issues

bennahugo commented 5 years ago

@sjperkins master now goes nowhere fast.

tricolour - 2019-06-21 16:27:18,449 INFO - Searching /home/hugo/.tricolour
tricolour - 2019-06-21 16:27:18,449 INFO - Found static mask file /home/hugo/tricolour/tricolour/data/4k_lband_meerkat.staticmask
tricolour - 2019-06-21 16:27:18,451 INFO - Loaded mask /home/hugo/tricolour/tricolour/data/4k_lband_meerkat.staticmask (non-dilated) with 41.50% flagged bandwidth between 0.856 and 1.712 GHz
tricolour - 2019-06-21 16:27:22,960 INFO - Adding field 'J0825-5010' scan 1 to compute graph for processing
/home/hugo/tricolour/tricolour/packing.py:212: PerformanceWarning: Increasing number of chunks by factor of 87
  dtype=np.bool)
/home/hugo/tricolour/tricolour/packing.py:220: PerformanceWarning: Increasing number of chunks by factor of 87
  dtype=data.dtype)
/home/hugo/tricolour/tricolour/packing.py:227: PerformanceWarning: Increasing number of chunks by factor of 87
  dtype=flags.dtype)
/home/hugo/tricolour/tricolour/packing.py:271: PerformanceWarning: Increasing number of chunks by factor of 87
  dtype=flag_windows.dtype)
tricolour - 2019-06-21 16:27:24,200 INFO - Adding field 'J0825-5010' scan 2 to compute graph for processing
tricolour - 2019-06-21 16:27:25,457 INFO - Adding field 'J0825-5010' scan 3 to compute graph for processing
tricolour - 2019-06-21 16:27:26,689 INFO - Adding field 'J0825-5010' scan 4 to compute graph for processing
tricolour - 2019-06-21 16:27:27,939 INFO - Adding field 'J0825-5010' scan 5 to compute graph for processing
tricolour - 2019-06-21 16:27:29,123 INFO - Adding field 'J0825-5010' scan 6 to compute graph for processing
tricolour - 2019-06-21 16:27:30,307 INFO - Adding field 'J0825-5010' scan 7 to compute graph for processing
tricolour - 2019-06-21 16:27:31,527 INFO - Adding field 'J0825-5010' scan 8 to compute graph for processing
tricolour - 2019-06-21 16:27:32,761 INFO - Adding field 'J0825-5010' scan 9 to compute graph for processing
tricolour - 2019-06-21 16:27:34,009 INFO - Adding field 'J0825-5010' scan 10 to compute graph for processing
/home/hugo/J0825/venv/local/lib/python2.7/site-packages/dask/blockwise.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  contains = index in indices
[##############                          ] | 37% Completed | 65hr 43min 58.6s^[[B^[[B^[[B
tricolour - 2019-06-21 16:27:18,449 INFO - Searching /home/hugo/.tricolour
tricolour - 2019-06-21 16:27:18,449 INFO - Found static mask file /home/hugo/tricolour/tricolour/data/4k_lband_meerkat.staticmask
tricolour - 2019-06-21 16:27:18,451 INFO - Loaded mask /home/hugo/tricolour/tricolour/data/4k_lband_meerkat.staticmask (non-dilated) with 41.50% flagged bandwidth between 0.856 and 1.712 GHz
tricolour - 2019-06-21 16:27:22,960 INFO - Adding field 'J0825-5010' scan 1 to compute graph for processing
/home/hugo/tricolour/tricolour/packing.py:212: PerformanceWarning: Increasing number of chunks by factor of 87
  dtype=np.bool)
/home/hugo/tricolour/tricolour/packing.py:220: PerformanceWarning: Increasing number of chunks by factor of 87
  dtype=data.dtype)
/home/hugo/tricolour/tricolour/packing.py:227: PerformanceWarning: Increasing number of chunks by factor of 87
  dtype=flags.dtype)
/home/hugo/tricolour/tricolour/packing.py:271: PerformanceWarning: Increasing number of chunks by factor of 87
  dtype=flag_windows.dtype)
tricolour - 2019-06-21 16:27:24,200 INFO - Adding field 'J0825-5010' scan 2 to compute graph for processing
tricolour - 2019-06-21 16:27:25,457 INFO - Adding field 'J0825-5010' scan 3 to compute graph for processing
tricolour - 2019-06-21 16:27:26,689 INFO - Adding field 'J0825-5010' scan 4 to compute graph for processing
tricolour - 2019-06-21 16:27:27,939 INFO - Adding field 'J0825-5010' scan 5 to compute graph for processing
tricolour - 2019-06-21 16:27:29,123 INFO - Adding field 'J0825-5010' scan 6 to compute graph for processing
tricolour - 2019-06-21 16:27:30,307 INFO - Adding field 'J0825-5010' scan 7 to compute graph for processing
tricolour - 2019-06-21 16:27:31,527 INFO - Adding field 'J0825-5010' scan 8 to compute graph for processing
tricolour - 2019-06-21 16:27:32,761 INFO - Adding field 'J0825-5010' scan 9 to compute graph for processing
tricolour - 2019-06-21 16:27:34,009 INFO - Adding field 'J0825-5010' scan 10 to compute graph for processing
/home/hugo/J0825/venv/local/lib/python2.7/site-packages/dask/blockwise.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  contains = index in indices
[##############                          ] | 37% Completed | 65hr 43min 58.6s^[[B^[[B^[[B

sjperkins commented 5 years ago

On a single small 32k scan (32k1934.ms.2min.ms) the difference is not large (this is on stevie whose disk may be contested).

xarray-ms master:

tricolour - 2019-06-24 11:13:47,654 INFO - Data flagged successfully in 00h07m37s

xarray-ms 0.1.6:

tricolour - 2019-06-24 11:25:57,847 INFO - Data flagged successfully in 00h05m51s

I'll see what happens on a larger dataset

sjperkins commented 5 years ago

On my laptop with a 3 scan 4K dataset (1527016443_sdp_l0.full_1284.hh_vv.ms) the difference is neglible:

xarray-ms master:

tricolour - 2019-06-24 10:51:58,933 INFO - Data flagged successfully in 00h10m33s

xarray-ms 0.1.6

tricolour - 2019-06-24 11:51:31,495 INFO - Data flagged successfully in 00h10m31s

sjperkins commented 5 years ago

Working hypothesis: 52 GB scans are being created which are too big to fit on a single DIMM, leading to unnecessary inter-CPU commuication.

Solution: Allocate per-baseline arrays.

sjperkins commented 5 years ago

Spot the non-linearity:

# 7.45 GB
In [11]: %timeit -n 1 -r 1 np.ones(int(1e9), dtype=np.complex64)
1 loop, best of 1: 2.1 s per loop

# 14.9 GB
In [12]: %timeit -n 1 -r 1 np.ones(int(2e9), dtype=np.complex64)
1 loop, best of 1: 4.22 s per loop

# 29.8 GB
In [13]: %timeit -n 1 -r 1 np.ones(int(4e9), dtype=np.complex64)
1 loop, best of 1: 2min 42s per loop

Things seem to get much slower above the 16GB power of 2.

# 16.3GB
In [23]: %timeit -n 1 -r 1 np.ones(int(2.2e9), dtype=np.complex64)
1 loop, best of 1: 49.5 s per loop

bennahugo commented 5 years ago

Can't reproduce. See my email. Are you sure it is not just the page caches playing heavoc with you?

On Mon, 24 Jun 2019, 17:08 Simon Perkins, notifications@github.com wrote:

Spot the non-linearity:

7.45 GB

In [11]: %timeit -n 1 -r 1 np.ones(int(1e9), dtype=np.complex64) 1 loop, best of 1: 2.1 s per loop

14.9 GB

In [12]: %timeit -n 1 -r 1 np.ones(int(2e9), dtype=np.complex64) 1 loop, best of 1: 4.22 s per loop

29.8 GB

In [13]: %timeit -n 1 -r 1 np.ones(int(4e9), dtype=np.complex64) 1 loop, best of 1: 2min 42s per loop

Things seem to get much slower above the 16GB power of 2.

16.3GB

In [23]: %timeit -n 1 -r 1 np.ones(int(2.2e9), dtype=np.complex64) 1 loop, best of 1: 49.5 s per loop

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ska-sa/tricolour/issues/25?email_source=notifications&email_token=AB4RE6XWPM7QZBSIXP7HINDP4DPQRA5CNFSM4H24ETL2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYNHYRI#issuecomment-505052229, or mute the thread https://github.com/notifications/unsubscribe-auth/AB4RE6U6VJBYWVYG42V7DXDP4DPQRANCNFSM4H24ETLQ .

sjperkins commented 5 years ago

Can't reproduce.

Where can't you reproduce this? The above timings are on stevie where the problem was reported.

bennahugo commented 5 years ago

stevie

bennahugo commented 5 years ago

This is the OS default without NUMA pinning.

sjperkins commented 5 years ago

Are you just doing a malloc np.empty, a zeroed malloc np.zeros or a np.ones?

mallocs and zero'd mallocs are fast, its filling the array where we see the slowdown.

Anyhow, now that the dead Cubical processes have been cleared out on stevie things have improved. I suspect that the Cubical shared memory model was holding onto pages and might have provoked swapping

# 7.45 GB
In [4]: %timeit -r 1 -n 1 np.ones(int(1e9), dtype=np.complex64).fill(1+0j)
1 loop, best of 1: 3.07 s per loop

# 14.9GB
In [5]: %timeit -r 1 -n 1 np.ones(int(2e9), dtype=np.complex64).fill(1+0j)
1 loop, best of 1: 6.24 s per loop

# 29.9 GB
In [6]: %timeit -r 1 -n 1 np.ones(int(4e9), dtype=np.complex64).fill(1+0j)
1 loop, best of 1: 12.2 s per loop

# 60 GB
In [7]: %timeit -r 1 -n 1 np.ones(int(8e9), dtype=np.complex64).fill(1+0j)
1 loop, best of 1: 25.3 s per loop

# 120 GB  non-linear jump here
In [8]: %timeit -r 1 -n 1 np.ones(int(16e9), dtype=np.complex64).fill(1+0j)
1 loop, best of 1: 4min per loop

sjperkins commented 5 years ago

I made a mistake, allocated with ones and then filled with ones again so duplicated work. Anyway, the allocation and fill is faster now.

bennahugo commented 5 years ago

Its malloc and memset with np.zeros.

On Thu, 27 Jun 2019, 22:36 Simon Perkins, notifications@github.com wrote:

I made a mistake, allocated with ones and then filled with ones again so duplicated work. Anyway, the allocation and fill is faster now.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ska-sa/tricolour/issues/25?email_source=notifications&email_token=AB4RE6RO5W66DE2NEUKRC5LP4UQF5A5CNFSM4H24ETL2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYYJSTQ#issuecomment-506501454, or mute the thread https://github.com/notifications/unsubscribe-auth/AB4RE6Q6YWZ4ZIHPIDKC3G3P4UQF5ANCNFSM4H24ETLQ .

sjperkins commented 5 years ago

Currently, only a single thread allocates all memory for a window. It's pretty slow.

sjperkins commented 5 years ago

See further description of related issues in https://github.com/ska-sa/tricolour/issues/36

ratt-ru / tricolour

52 GB scan arrays create NUMA issues #25

7.45 GB

14.9 GB

29.8 GB

16.3GB