ratt-ru / QuartiCal

CubiCal, but with greater power.
MIT License
8 stars 4 forks source link

BLAS : Program is Terminated. Because you tried to allocate too many memory regions. #97

Closed o-smirnov closed 2 years ago

o-smirnov commented 3 years ago
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.                                                BLAS : Program is Terminated. Because you tried to allocate too many memory regions.                                                
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.                                                BLAS : Program is Terminated. Because you tried to allocate too many memory regions.                                                
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.                                                BLAS : Program is Terminated. Because you tried to allocate too many memory regions.                                                
BLAS : Bad memory unallocation! :  128  0x7f252fe6d000                                                                              
BLAS : Bad memory unallocation! :  128  0x7f24a4dd1000                                                                              BLAS : Program is Terminated. Because you tried to allocate too many memory regions.                                                
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.                                                BLAS : Program is Terminated. Because you tried to allocate too many memory regions.                                                
BLAS : Bad memory unallocation! :  128  0x7f32fa3e6000                                                                              BLAS : Bad memory unallocation! :  128  0x7f2ae0750000                                                                              
Segmentation fault (core dumped)                                                                                                    

BLAS is maths library not English library, clearly. Because you shouldn't start a sentence with a preposition. Admittedly, is lesser sin than gluttony! Because trying to allocate too many memory regions is.

This was on a 32-antenna, 1k channel MeerKAT MS with 128 dask threads. I reduced the threads to 64, and now it runs (in a steady and modest ~80G memory, so I find it odd that having twice the threads caused this gluttony.)

input_ms:
  path: ../msdir/1627405250_sdp_l0-J2009_2026-corr.ms
  data_column: DATA
  weight_column: WEIGHT_SPECTRUM
  time_chunk: '128s'
  freq_chunk: '1GHz'
  select_fields: []
  select_ddids: []
input_model:
  recipe: MODEL_DATA:DIR1_DATA
  apply_p_jones: true
solver:
  terms: [G,dE]
  iter_recipe: [25,25,25,25,25]
output:
  gain_dir: gains.qc
  products: [corrected_data, corrected_residual]
  columns: [CORR_DATA, RES_DATA]
  net_gain: true
mad_flags:
  enable: false
  threshold_bl: 10
  threshold_global: 12
dask:
  threads: 64
  scheduler: distributed
G:
  type: delay
  direction_dependent: false
  time_interval: '8s'
  freq_interval: '1GHz'
  # load_from:
  # interp_mode: reim
  # interp_method: 2dlinear
dE:
  type: complex
  direction_dependent: true
  time_interval: '128s'
  freq_interval: '50MHz'
bennahugo commented 3 years ago

ATLAS instead?

On Tue, 3 Aug 2021, 22:03 Oleg Smirnov, @.***> wrote:

BLAS : Program is Terminated. Because you tried to allocate too many memory regions. BLAS : Program is Terminated. Because you tried to allocate too many memory regions. BLAS : Program is Terminated. Because you tried to allocate too many memory regions. BLAS : Program is Terminated. Because you tried to allocate too many memory regions. BLAS : Program is Terminated. Because you tried to allocate too many memory regions. BLAS : Program is Terminated. Because you tried to allocate too many memory regions. BLAS : Bad memory unallocation! : 128 0x7f252fe6d000 BLAS : Bad memory unallocation! : 128 0x7f24a4dd1000 BLAS : Program is Terminated. Because you tried to allocate too many memory regions. BLAS : Program is Terminated. Because you tried to allocate too many memory regions. BLAS : Program is Terminated. Because you tried to allocate too many memory regions. BLAS : Bad memory unallocation! : 128 0x7f32fa3e6000 BLAS : Bad memory unallocation! : 128 0x7f2ae0750000 Segmentation fault (core dumped)

BLAS is maths library not English library, clearly. Because you shouldn't start a sentence with a preposition. Admittedly, is lesser sin than gluttony! Because trying to allocate too many memory regions is.

This was on a 32-antenna, 1k channel MeerKAT MS with 128 dask threads. I reduced the threads to 64, and now it runs (in a steady and modest ~80G memory, so I find it odd that having twice the threads caused this gluttony.)

input_ms: path: ../msdir/1627405250_sdp_l0-J2009_2026-corr.ms data_column: DATA weight_column: WEIGHT_SPECTRUM time_chunk: '128s' freq_chunk: '1GHz' select_fields: [] select_ddids: []input_model: recipe: MODEL_DATA:DIR1_DATA apply_p_jones: truesolver: terms: [G,dE] iter_recipe: [25,25,25,25,25]output: gain_dir: gains.qc products: [corrected_data, corrected_residual] columns: [CORR_DATA, RES_DATA] net_gain: truemad_flags: enable: false threshold_bl: 10 threshold_global: 12dask: threads: 64 scheduler: distributedG: type: delay direction_dependent: false time_interval: '8s' freq_interval: '1GHz'

load_from:

interp_mode: reim

interp_method: 2dlineardE:

type: complex direction_dependent: true time_interval: '128s' freq_interval: '50MHz'

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/JSKenyon/QuartiCal/issues/97, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4RE6RQS55UC2Z2T2JT3STT3BDPPANCNFSM5BPQSVZQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

JSKenyon commented 3 years ago

This is a tricky one, and it is definitely related to the number of threads in use. These are probably relevant: https://github.com/xianyi/OpenBLAS/issues/1882 and https://stackoverflow.com/questions/45086246/too-many-memory-regions-error-with-dask.

@o-smirnov Can you please try doing export MKL_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1 OMP_NUM_THREADS=1 and rerunning with 128 threads? If that runs through then the culprit is nested parallelism in some of the numpy/numba routines. I will get round to disabling this from the get-go at some point as it is detrimental to performance.

JSKenyon commented 3 years ago

Note that the nested parallelism that QuartiCal itself uses shouldn't have this problem. This is specifically every dask thread trying to use 128 threads when invoking parallel numpy-like functions (I suspect that in your case the culprit is the np.linalg.solve in the delay solver).

JSKenyon commented 2 years ago

Closing for now - there is no longer an np.linalg.solve call. Please reopen if you encounter this issue again.