ratt-ru / CubiCal

A fast radio interferometric calibration suite.
GNU General Public License v2.0
18 stars 13 forks source link

Dealing with OOM errors - setting appropriate parameters. #466

Closed Kincaidr closed 1 year ago

Kincaidr commented 1 year ago

Command: gocubical cc-parsets/solve-kde.parset --data-ms Abell3376_timechannel_t11_c4.ms --data-time-chunk 128 --model-list MODEL_DATA+-DIR1_DATA+-DIR2_DATA+-DIR3_DATA:DIR1_DATA:DIR2_DATA:DIR3_DATA --sol-jones K,de --sol-term-iters 100,50 --k-type complex-2x2 --k-time-int 4 --k-freq-int 0 --out-column CORRECTED_RESIDUAL --out-mode sr --out-dir obs1lb/solve1-kde --out-subtract-dirs 1,2,3 --dist-ncpu 32 --debug-pdb False --dist-safe 0 --dist-nworker 2

Error:

INFO      11:40:36 - main               [io] [3.2/3.5 6.5/10.0 34.6Gb] I/O job(s) complete
INFO      11:40:36 - main               [0.3/3.5 3.6/10.0 34.6Gb] submitting solver jobs for tile 0/1
INFO      11:45:07 - solver             [x02] [153.4/315.4 156.8/346.1 34.6Gb] K kernels are <module 'cubical.kernels.diagdiag_complex' from '/home/kincaid/Software/CubiCal/cubical/kernels/diagdiag_complex.py'> <module 'cubical.kernels.diagdiag_complex' from '/home/kincaid/Software/CubiCal/cubical/kernels/diagdiag_complex.py'>
INFO      11:45:07 - solver             [x02] [153.4/315.4 156.8/346.1 34.6Gb] dE kernels are <module 'cubical.kernels.diagdiag_complex' from '/home/kincaid/Software/CubiCal/cubical/kernels/diagdiag_complex.py'> <module 'cubical.kernels.diagdiag_complex' from '/home/kincaid/Software/CubiCal/cubical/kernels/diagdiag_complex.py'>
INFO      11:45:59 - solver             [x01] [154.7/371.1 158.1/422.6 34.6Gb] K kernels are <module 'cubical.kernels.diagdiag_complex' from '/home/kincaid/Software/CubiCal/cubical/kernels/diagdiag_complex.py'> <module 'cubical.kernels.diagdiag_complex' from '/home/kincaid/Software/CubiCal/cubical/kernels/diagdiag_complex.py'>
INFO      11:45:59 - solver             [x01] [154.7/371.1 158.1/422.6 34.6Gb] dE kernels are <module 'cubical.kernels.diagdiag_complex' from '/home/kincaid/Software/CubiCal/cubical/kernels/diagdiag_complex.py'> <module 'cubical.kernels.diagdiag_complex' from '/home/kincaid/Software/CubiCal/cubical/kernels/diagdiag_complex.py'>
INFO      11:58:59 - solver             [x02] [163.3/344.0 279.5/577.9 34.6Gb] D0T1F0 chi^2_0 1.199; K: 1680/1680 ints (229368-477076 EPA) 60/60 ants, MGE 0.00102; dE: 11328/11520 ints (76-143840 EPA) 4 dirs 60/60 ants, MGE 0.0 0.523 0.826 0.847; noise 0.179, flags: PRIOR:230457000(56.32%) MISSING:6819840(1.67%) MAD:1578526(0.39%) SKIPSOL:16524882(4.04%)
INFO      12:00:13 - solver             [x01] [174.3/411.5 281.9/669.1 34.6Gb] D0T0F0 chi^2_0 1.143; K: 1680/1680 ints (285876-474984 EPA) 60/60 ants, MGE 0.000937; dE: 11327/11520 ints (152-143720 EPA) 4 dirs 60/60 ants, MGE 0.0 0.382 0.622 0.465; noise 0.189, flags: PRIOR:233961382(56.67%) MISSING:6881280(1.67%) MAD:1815606(0.44%) SKIPSOL:19050520(4.61%)
INFO      12:36:13 - main               [0.0/0.0 3.7/10.2 0.0Gb] child process 74862 exited with status 9. This is a bug, or an out-of-memory condition.
INFO      12:36:16 - main               [0.0/0.0 3.7/10.2 0.0Gb] This error is not recoverable: the main process will now commit ritual harakiri.
(END)

I was told its a memory error, However, lowering nworker, dist-ncpu and --data-time-chunk gives same.

Full log:

ddcal_0.log

bennahugo commented 1 year ago

data-time and freq chunk should be a multiple of the intervals (an integral multiple). I think you may need to forgo a single chunk in frequency (and a single frequency gain). Try splitting your G bandwidth in 3 or 4 your max num chunks is sensible

JSKenyon commented 1 year ago

As long as you can be patient, simply set --dist-ncpu 0, --dist-max-chunks 1. The memory estimate was already larger than that available on the box. That said, you can easily improve this by tuning the chunk size/workers and number of chunks. Your solution intervals are not that long in time (10 on the dE term). I would recommend doing something like --data-time-chunk 10, --dist-max-chunks 4, --dist-ncpu 5 , --dist-nworker 0, --dist-nthread 0. This should (in theory) only load 40 times into memory simultaneously. If that doesn't work, shoot me the log and we can try again.

Kincaidr commented 1 year ago

I would recommend doing something like --data-time-chunk 10, --dist-max-chunks 4, --dist-ncpu 5 , --dist-nworker 0, --dist-nthread 0.

This has done it.