hpc4cmb / toast

Time Ordered Astrophysics Scalable Tools
Other
44 stars 39 forks source link

Expected memory usage in TOAST satellite sim #199

Closed zonca closed 6 years ago

zonca commented 6 years ago

In the PICO noise simulations, I am running with just 1 pixel and 1 MPI process per node and I can fit 3 days worth of data per node. I am using a configuration very similar to the tiny satellite example for everything else. PICO is 1 KHz, so I expect 1 day of 1 pixel to occupy 1000*3600*24*8*2/1e9 = 1.4 GB

so 3 days would be 4.2 GB

I see madam allocating 6.1 GB for TOD + 14 GB pointing.

TiMemory reports 34 GB current and 53 GB peak.

I am just checking if you think all these numbers look reasonable, so I can use the metric of fitting between 4 GB and 5 GB of data per node to guess the initial size of jobs.

zonca commented 6 years ago

@tskisner @keskitalo does this sound reasonable?

keskitalo commented 6 years ago

@zonca while Madam reports TOD and pointing memory being allocated, the Toast OpMadam operator is set up so that the toast buffers containing the TOD, pixel numbers and pointing weights are deallocated when the data are moved into the Madam buffers (and then generated again when Madam finishes.

Your calculation above seems to indicate only two doubles per detector sample. Instead, there should be

Since you are running a single detector simulation, the bore sight pointing and the timestamps can be a significant part of your memory overhead. You should check (possibly calling cache.report()) that the quaternions are purged from the cache once the pixel numbers and pointing weights have been generated.

zonca commented 6 years ago

ok, thanks, I'll experiment more with the next PICO runs to make sure I understand how RAM is being used.

zonca commented 6 years ago

I am running a full focal plane sim for PICO band 9 with 684 detectors. For my first test I am simulating 1 observation of 6 hours on 138 nodes, just noise. It fails with memory error.

That is 5 detectors each node, so about 126*3600*6*8*5/1e9 = 0.1 GB of signal TOD. Even with all the overhead I don't see how this cannot fit.

@tskisner @keskitalo what do you recommend I do?

zonca commented 6 years ago

See log in /global/cscratch1/sd/zonca/pico/full_focal_plane_noise/log

keskitalo commented 6 years ago

I suggest you reduce the problem until it runs and examine the memory profile.

jdborrill commented 6 years ago

The representative run still works, no? So we know it can be done.

J

On Mar 4, 2018, at 11:51, Reijo Keskitalo notifications@github.com wrote:

I suggest you reduce the problem until it runs and examine the memory profile.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

zonca commented 6 years ago

@jdborrill good idea, I'm running the representative run now

@keskitalo ok, reduced to 2.4 hours observations and resubmitted

zonca commented 6 years ago

@jdborrill still working on getting the representative example to work on Edison. it is failing with memory error due to the dipole simulation. I resubmitted with no dipole and no PySM.

jdborrill commented 6 years ago

Hi Andrea,

Weren't the PICO noise-only runs failing?

J

On Tue, Mar 6, 2018 at 1:37 PM, Andrea Zonca notifications@github.com wrote:

@jdborrill https://github.com/jdborrill still working on getting the representative example to work on Edison. it is failing with memory error due to the dipole simulation. I resubmitted with no dipole and no PySM.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hpc4cmb/toast/issues/199#issuecomment-370783351, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXFyl0kIP4yy8I-uAO-o2Op7qUNPwckks5tbpEwgaJpZM4RknXf .

zonca commented 6 years ago

Yes, I was testing the representative example I'm the default configuration. Trying noise only now.

zonca commented 6 years ago

representative example works on Edison:

Memory occupied by signal only:

now back to PICO

zonca commented 6 years ago

PICO band 9 is a factor of 5 larger per observation:

10 nodes per observation to keep 6 GB per node.

Testing 30 observations on 300 nodes.

zonca commented 6 years ago

PICO went out of memory, resubmitted with only 15 observations.

@keskitalo I don't understand how should I choose groupnodes, I set it to 2, same as the representative example. Is it a good choice? could it have direct impact on memory usage?

keskitalo commented 6 years ago

@zonca I'm not an expert on the satellite simulation pipeline. It was written by @tskisner.

zonca commented 6 years ago

out of memory even with 15 observations, during pointing, inside qarray:

Read parameters, compute data distribution:  0.03 seconds                                                     
Construct boresight pointing:  4.85 seconds                                                                   
terminate called after throwing an instance of 'toast::exception'                                             
terminate called after throwing an instance of 'toast::exception'                                             
terminate called after throwing an instance of 'toast::exception'                                             
  what():  Exeption at line 20 of file toast_memory.cpp:  cannot allocate 87091208 bytes of memory with       alignment 64                                                                                                  
  what():  Exeption at line 20 of file toast_memory.cpp:  cannot allocate 87091208 bytes of memory with       
alignment 64                                                                                                  
  what():  Exeption at line 20 of file toast_memory.cpp:  cannot allocate 87091208 bytes of memory with       
alignment 64 
[32/44] : /scratch3/scratchdirs/zonca/toast-prefix/lib/libtoast.so(_ZN5toast6qarray9amplitudeEmmmPKdPd+       
0x3a)                                                                                                         [0x2aaaacaff8ca]                                                                                              
[31/44] : /scratch3/scratchdirs/zonca/toast-prefix/lib/libtoast.so(_ZN5toast6qarray9normalizeEmmmPKdPd+       
0x8e)    
zonca commented 6 years ago

@tskisner do you have any suggestion on how to debug this out-of-memory issue? PICO goes out of memory even with just 15 observations on 300 nodes. The representative example doesn't go out of memory with twice that amount of data.

keskitalo commented 6 years ago

Have you looked at the memory profile of the representative run? Where is the memory used?

tskisner commented 6 years ago

Hi @zonca, sorry, this thread did not show up on my notifications, even with the mention! I have not tried to run any big pico sims. Will try to at least reproduce the behavior you are seeing.

zonca commented 6 years ago

@keskitalo on the representative run if I look at the peak memory of the process:

if I look at peak memory of the function:

output folder on Edison is /scratch3/scratchdirs/zonca/launch_toast_representative/out_representative_satellite_edison

keskitalo commented 6 years ago

@zonca This is really useful. Can you tell if those numbers match your expectations about the memory requirements? If you have the memory profile for all processes, is there a lot of variability?

zonca commented 6 years ago

@keskitalo yes, this is reasonable, I expect .75 GB just for the signal per rank, 4.6 GB is 6x, which is reasonable to have also pointing. There is little variability in the peak total RAM used, from 4.4 GB to 4.65 GB.

I don't see anything suspicious in the representative run. the issue is PICO, I guess due to the number of detectors.

jdborrill commented 6 years ago

PICO has significantly higher number of detectors, sampling rate, and duration.

This is failing before you reach MADAM, correct? Reijo did fix a memory issue there yesterday.

Julian

On Fri, Mar 16, 2018 at 11:12 AM, Andrea Zonca notifications@github.com wrote:

@keskitalo https://github.com/keskitalo yes, this is reasonable, I expect .75 GB just for the signal per rank, 4.6 GB is 6x, which is reasonable to have also pointing. There is little variability in the peak total RAM used, from 4.4 GB to 4.65 GB.

I don't see anything suspicious in the representative run. the issue is PICO, I guess due to the number of detectors.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hpc4cmb/toast/issues/199#issuecomment-373799970, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXFypL_7D6Z8u96Amz0wGSZUHxAP5LYks5tfACfgaJpZM4RknXf .

zonca commented 6 years ago

Yes, I'm scaling the number of observations accordingly, please crosscheck above, maybe I messed up my calculations.

Correct, it fails at pointing.

tskisner commented 6 years ago

@zonca, for the satellite sim example, the --groupsize option controls how many processes are in a group. Each group is assigned some of the observations. So if you are using 24 processes per node, and set groupsize to 96, then a single observation would be processed by 4 nodes. If you don't specify this, then you get one big group the size of MPI_COMM_WORLD.

Let's say you are using 300 nodes on edison with 24 processes per node (7200 processes). Using the default for this option would only make sense if you had more than 7200 detectors, otherwise some processes would sit idle.

Usually what I do is to see how many nodes I need to have enough memory for one observation. Then I set that number of processes to the groupsize and make the total job size some multiple of this (so there are a whole number of groups). Then this number of groups will work on all the observations.

Example: You have 365 daily observations and 10000 detectors.

  1. Running a single day of data (one observation) using a single group (the default), you find that you need 16 nodes with 8 processes each on edison to have enough memory. In this case, the 10000 detectors are distributed among the 128 processes (~79 detectors per process) and everything runs in 10 minutes.

  2. Based on that test, we set our --groupsize to 128. Let's say we have 10 groups (160 nodes using the same 8 processes per node). The 365 observations will be distributed to those 10 groups, with each one getting 36 or 37 observations. The run time should be about ~37 x 10 minutes = 370 minutes = just over 6 hours. If we ran with 20 groups (320 nodes), we might hope to get that down to ~3 hours.

So the idea is to tune the group size based on the largest observation, and then make sure that each group has several observations assigned to it so that they can be load balanced.

zonca commented 6 years ago

thanks, is there a place where we can copy your explanation into the docs? I'll make some tests and report back here

zonca commented 6 years ago

ok, I am testing with 1 group only and 1 observation, works with 10 nodes (uses 55GB of memory on each node), it fails with 8 nodes. I am testing now 30 observations on 300 nodes with each group spanning 10 nodes.

zonca commented 6 years ago

yes! 30 observations on 300 nodes work, (I tested 39 observations and that goes out of memory). We want to make a 4 years simulation, I will test on Cori2 to check how many phi nodes we would need for that. I guess we will need a reservation, it will be a huge job. We can run production after #224 is fixed.

tskisner commented 6 years ago

Ok, thanks for the testing! Does the scan strategy repeat every year? Just wondering if we can get way with doing a one-year sim and adjusting the detector NET accordingly...

jdborrill commented 6 years ago

We should certainly think about it, though we might need 2 years to be able to do year/year splits.

Something is horribly wrong with the memory use though - 30 days is about 10GB of data, and 300 nodes have about 30TB of memory. Do you have the profile?

J

On Fri, Mar 23, 2018 at 7:25 AM, Theodore Kisner notifications@github.com wrote:

Ok, thanks for the testing! Does the scan strategy repeat every year? Just wondering if we can get way with doing a one-year sim and adjusting the detector NET accordingly...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hpc4cmb/toast/issues/199#issuecomment-375681952, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXFysvAWsTHCKMFHgvmOTOz5VlrwoASks5thQXvgaJpZM4RknXf .

zonca commented 6 years ago

yes, see the timing file in /global/cscratch1/sd/zonca/pico/full_focal_plane_noise/band_09 it uses 8GB per procs and there are 8 processes per node, edison has 64GB RAM

jdborrill commented 6 years ago

My bad - I miscounted the zeros in my spreadsheet and turned TB into GB! If only it were that easy ...

So 30 days is 10TB of data and 30 nodes have 30TB of memory and there's a high water mark somewhere in the pointing that can potentially add up to 8 more doubles per sample, which would increase the requirement to 25TB.

Sounds like we're exercising that "potentially"; do we need to be?

J

On Fri, Mar 23, 2018 at 7:36 AM, Julian Borrill jdborrill@lbl.gov wrote:

We should certainly think about it, though we might need 2 years to be able to do year/year splits.

Something is horribly wrong with the memory use though - 30 days is about 10GB of data, and 300 nodes have about 30TB of memory. Do you have the profile?

J

On Fri, Mar 23, 2018 at 7:25 AM, Theodore Kisner <notifications@github.com

wrote:

Ok, thanks for the testing! Does the scan strategy repeat every year? Just wondering if we can get way with doing a one-year sim and adjusting the detector NET accordingly...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hpc4cmb/toast/issues/199#issuecomment-375681952, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXFysvAWsTHCKMFHgvmOTOz5VlrwoASks5thQXvgaJpZM4RknXf .

tskisner commented 6 years ago

Obviously the detector quaternions (4 doubles per sample) are generated as a temporary object:

https://github.com/hpc4cmb/toast/blob/master/src/python/tod/sim_tod.py#L339

... and passed back to the code computing the pixel indices and weights:

https://github.com/hpc4cmb/toast/blob/master/src/python/tod/pointing.py#L263

... and then the quaternions are del'ed:

https://github.com/hpc4cmb/toast/blob/master/src/python/tod/pointing.py#L291

zonca commented 6 years ago

As discussed yesterday, I tested the same run of 30 days on 300 nodes (which was working fine with 24-hours observations), now with 30*24 one-hour observations. It goes out of memory.

I was using a groupsize of 10 before, I reduced it to 1 as now my observations are 1/24th, @tskisner is this ok?

group_num_nodes=1
groupsize=$(( node_proc * group_num_nodes ))

here the log:

srun --cpu_bind=cores -n 2400 -N 300 -c 6 /scratch3/scratchdirs/zonca/toast-prefix/bin/toast_satellite_sim.   py @pico_scanning_noise.par --toast-output-dir out/onehourobs_band_09 --samplerate 126.0 --toast-timing-fname timing_report_main --groupsize 8 --madam --madampar pico_madam.par --fp focalplane/pico_09.pkl --numobs 720 --outdir out/onehourobs_band_09/out                                                                             
OpenMP number of threads: 3
Running with 2400 processes
Create focalplane:  0.07 seconds
Read parameters, compute data distribution:  0.03 seconds
Construct boresight pointing:  0.85 seconds
Pointing generation took 26.194 s
Scanning local pixels
Local submaps identified in 1.133 s
Noise simulation took 24.141 s
terminate called after throwing an instance of 'toast::exception'
  what():  Exeption at line 20 of file toast_memory.cpp:  cannot allocate 925346040 bytes of memory with alignment 64
> rank 45
|0045> [pyc] main@'toast_satellite_sim.py'                :  81.673 wall, 159.610 user +  16.960 system = 176.570 CPU [sec] (216.2%) : RSS {tot,self}_{curr,peak} : (5655.5|5522.9) | (5556.6|5426.5) [MB]
|0045> [pyc] |_slew_precession_axis                       :   0.390 wall,   0.300 user +   0.080 system =   0.380 CPU [sec] ( 97.5%) : RSS {tot,self}_{curr,peak} : ( 296.9| 290.1) | (  48.0|  47.0) [MB] (total # of laps: 3)
|0045> [pyc] |_satellite_scanning                         :   0.409 wall,   0.990 user +   0.160 system =   1.150 CPU [sec] (281.3%) : RSS {tot,self}_{curr,peak} : ( 296.9| 290.1) | ( 129.0| 115.6) [MB] (total # of laps: 3)
|0045> [pyc]   |_rotation                                 :   0.004 wall,   0.020 user +   0.000 system =   0.020 CPU [sec] (529.2%) : RSS {tot,self}_{curr,peak} : ( 296.9| 290.1) | (   0.2|   0.0) [MB] (total # of laps: 6)
|0045> [pyc]   |_mult                                     :   0.136 wall,   0.330 user +   0.110 system =   0.440 CPU [sec] (322.5%) : RSS {tot,self}_{curr,peak} : ( 297.1| 290.1) | (  28.4|  27.3) [MB] (total # of laps: 12)
|0045> [cxx]     |_ctoast_qarray_mult                     :   0.040 wall,   0.090 user +   0.010 system =   0.100 CPU [sec] (252.2%) : RSS {tot,self}_{curr,peak} : ( 297.1| 290.1) | (  14.2|  13.8) [MB] (total # of laps: 12)
|0045> [pyc] |_exec@OpPointingHpix                        :  24.728 wall,  63.950 user +  10.150 system =  74.100 CPU [sec] (299.7%) : RSS {tot,self}_{curr,peak} : (3817.5|3793.5) | (3520.6|3503.4) [MB]
|0045> [pyc]   |_read_times@TODSatellite                  :   0.007 wall,   0.010 user +   0.010 system =   0.020 CPU [sec] (279.2%) : RSS {tot,self}_{curr,peak} : (2612.0|2609.7) | (   6.6|   0.0) [MB] (total # of laps: 3)
|0045> [pyc]   |_read_pntg@TODSatellite                   :   3.844 wall,   8.900 user +   2.590 system =  11.490 CPU [sec] (298.9%) : RSS {tot,self}_{curr,peak} : (3831.5|3779.6) | (  28.5|   0.0) [MB] (total # of laps: 255)
|0045> [pyc]     |_mult                                   :   3.790 wall,   8.720 user +   2.550 system =  11.270 CPU [sec] (297.3%) : RSS {tot,self}_{curr,peak} : (3831.5|3779.6) | (  28.4|   0.0) [MB] (total # of laps: 255)
|0045> [cxx]       |_ctoast_qarray_mult                   :   0.832 wall,   2.290 user +   0.070 system =   2.360 CPU [sec] (283.6%) : RSS {tot,self}_{curr,peak} : (3831.5|3779.6) | (   3.5|   0.0) [MB] (total # of laps: 255)
|0045> [cxx]   |_ctoast_pointing_healpix_matrix           :  16.190 wall,  42.800 user +   5.620 system =  48.420 CPU [sec] (299.1%) : RSS {tot,self}_{curr,peak} : (3856.5|3793.5) | (  17.8|  24.2) [MB] (total # of laps: 255)
|0045> [pyc] |_get_submaps                                :   1.133 wall,   1.130 user +   0.000 system =   1.130 CPU [sec] ( 99.7%) : RSS {tot,self}_{curr,peak} : (3817.7|3793.5) | (   0.2|   0.0) [MB]
|0045> [pyc]   |_exec@OpLocalPixels                       :   1.086 wall,   1.090 user +   0.000 system =   1.090 CPU [sec] (100.3%) : RSS {tot,self}_{curr,peak} : (3818.6|3793.5) | (   0.9|   0.0) [MB]
|0045> [pyc] |_exec@OpSimNoise                            :  21.888 wall,  59.350 user +   6.100 system =  65.450 CPU [sec] (299.0%) : RSS {tot,self}_{curr,peak} : (4733.9|4651.9) | ( 916.2| 858.4) [MB]
|0045> [pyc]   |_simulate_chunk@OpSimNoise                :  21.887 wall,  59.350 user +   6.100 system =  65.450 CPU [sec] (299.0%) : RSS {tot,self}_{curr,peak} : (4733.9|4651.9) | ( 313.5| 294.4) [MB] (total # of laps: 3)
|0045> [cxx]     |_ctoast_sim_noise_sim_noise_timestream  :  20.626 wall,  55.900 user +   5.640 system =  61.540 CPU [sec] (298.4%) : RSS {tot,self}_{curr,peak} : (4731.0|4651.9) | (  24.4|   3.9) [MB] (total # of laps: 255)
|0045> [cxx]       |_sim_noise_timestream                 :  20.588 wall,  55.790 user +   5.630 system =  61.420 CPU [sec] (298.3%) : RSS {tot,self}_{curr,peak} : (4731.0|4651.9) | (  24.3|   3.9) [MB] (total # of laps: 255)
|0045> [pyc] |_exec@OpMadam                               :  30.354 wall,  30.470 user +   0.440 system =  30.910 CPU [sec] (101.8%) : RSS {tot,self}_{curr,peak} : (5655.5|5522.9) | ( 921.6| 871.1) [MB]
|0045> [pyc]   |__prepare@OpMadam                         :   0.034 wall,   0.020 user +   0.000 system =   0.020 CPU [sec] ( 58.3%) : RSS {tot,self}_{curr,peak} : (4734.1|4651.9) | (   0.2|   0.0) [MB]
|0045> [pyc]   |__stage_data@OpMadam                      :  30.319 wall,  30.440 user +   0.440 system =  30.880 CPU [sec] (101.9%) : RSS {tot,self}_{curr,peak} : (5655.5|5522.9) | ( 921.4| 871.1) [MB]
|0045> [pyc]     |__stage_time@OpMadam                    :   0.022 wall,   0.060 user +   0.000 system =   0.060 CPU [sec] (269.9%) : RSS {tot,self}_{curr,peak} : (4751.6|4651.9) | (  17.4|   0.0) [MB]                                                                                                                                  
|0045> [pyc]     |__stage_signal@OpMadam                  :   0.957 wall,   1.050 user +   0.440 system =   1.490 CPU [sec] (155.8%) : RSS {tot,self}_{curr,peak} : (5655.3|5522.7) | ( 903.4| 870.9) [MB]

### ERROR ###  [ MPI rank : 45 ] Error code : 6 @ 0xb6ce00001e24 :  Signal: SIGABRT (error code: 6) abort program (formerly SIGIOT)

Call Stack:
[39/40] : /scratch3/scratchdirs/zonca/toast-prefix/lib/libtoast.so(_ZN5toast14StackBackTraceERSo+0x28)                                   [0x2aaaac9b76e8]
[38/40] : /scratch3/scratchdirs/zonca/toast-prefix/lib/libtoast.so(_ZN5toast24TerminationSignalMessageEiP9siginfo_tRSo+0x15c)            [0x2aaaac9afb2c]
[37/40] : /scratch3/scratchdirs/zonca/toast-prefix/lib/libtoast.so(_ZN5toast24TerminationSignalHandlerEiP9siginfo_tPv+0x26d)             [0x2aaaac9ae7dd]
[36/40] : /lib64/libpthread.so.0(+0x10b20)                                                                                               [0x2aaaab21ab20]
[35/40] : /lib64/libc.so.6(gsignal+0x37)                                                                                                 [0x2aaaabd678c7]
[34/40] : /lib64/libc.so.6(abort+0x13a)                                                                                                  [0x2aaaabd68c9a]
[33/40] : /opt/gcc/6.3.0/snos/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x15d)                                   [0x2aaab932393d]
[32/40] : /opt/gcc/6.3.0/snos/lib64/libstdc++.so.6(+0x96916)                                                                             [0x2aaab9321916]
[31/40] : /opt/gcc/6.3.0/snos/lib64/libstdc++.so.6(+0x96961)                                                                             [0x2aaab9321961]
[30/40] : /opt/gcc/6.3.0/snos/lib64/libstdc++.so.6(+0x96b78)                                                                             [0x2aaab9321b78]
[29/40] : /scratch3/scratchdirs/zonca/toast-prefix/lib/libtoast.so(_ZN5toast3mem13aligned_allocEmm+0x544)                                [0x2aaaacafb094]
[28/40] : /scratch3/scratchdirs/zonca/toast-prefix/lib/python3.6/site-packages/toast/cbuffer.so(+0x1729)                                 [0x2aaac83ab729]
[27/40] : /global/common/software/cmb/edison/toast-deps/20171220_conda/bin/../lib/libpython3.6m.so.1.0(+0xe46b3)                         [0x2aaaaadb26b3]
[26/40] : /global/common/software/cmb/edison/toast-deps/20171220_conda/bin/../lib/libpython3.6m.so.1.0(_PyObject_FastCallDict+0x8b)      [0x2aaaaad40e8b]
[25/40] : /global/common/software/cmb/edison/toast-deps/20171220_conda/bin/../lib/libpython3.6m.so.1.0(+0x166408)                        [0x2aaaaae34408]
[24/40] : /global/common/software/cmb/edison/toast-deps/20171220_conda/bin/../lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x4186)  [0x2aaaaae39046]
[23/40] : /global/common/software/cmb/edison/toast-deps/20171220_conda/bin/../lib/libpython3.6m.so.1.0(+0x165930)                        [0x2aaaaae33930]
[22/40] : /global/common/software/cmb/edison/toast-deps/20171220_conda/bin/../lib/libpython3.6m.so.1.0(+0x166854)                        [0x2aaaaae34854]
[21/40] : /global/common/software/cmb/edison/toast-deps/20171220_conda/bin/../lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x4186)  [0x2aaaaae39046]
[20/40] : /global/common/software/cmb/edison/toast-deps/20171220_conda/bin/../lib/libpython3.6m.so.1.0(+0x165930)                        [0x2aaaaae33930]
[19/40] : /global/common/software/cmb/edison/toast-deps/20171220_conda/bin/../lib/libpython3.6m.so.1.0(+0x166854)                        [0x2aaaaae34854]
[18/40] : /global/common/software/cmb/edison/toast-deps/20171220_conda/bin/../lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x4186)  [0x2aaaaae39046]
[17/40] : /global/common/software/cmb/edison/toast-deps/20171220_conda/bin/../lib/libpython3.6m.so.1.0(+0x165930)                        [0x2aaaaae33930]
[16/40] : /global/common/software/cmb/edison/toast-deps/20171220_conda/bin/../lib/libpython3.6m.so.1.0(+0x166854)                        [0x2aaaaae34854]
[15/40] : /global/common/software/cmb/edison/toast-deps/20171220_conda/bin/../lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x4186)  [0x2aaaaae39046]
[14/40] : /global/common/software/cmb/edison/toast-deps/20171220_conda/bin/../lib/libpython3.6m.so.1.0(+0x16629e)                        [0x2aaaaae3429e]
[13/40] : /global/common/software/cmb/edison/toast-deps/20171220_conda/bin/../lib/libpython3.6m.so.1.0(+0x1665b2)                        [0x2aaaaae345b2]
[12/40] : /global/common/software/cmb/edison/toast-deps/20171220_conda/bin/../lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x4186)  [0x2aaaaae39046]
[11/40] : /global/common/software/cmb/edison/toast-deps/20171220_conda/bin/../lib/libpython3.6m.so.1.0(+0x165930)                        [0x2aaaaae33930]
[10/40] : /global/common/software/cmb/edison/toast-deps/20171220_conda/bin/../lib/libpython3.6m.so.1.0(+0x166854)                        [0x2aaaaae34854]
[ 9/40] : /global/common/software/cmb/edison/toast-deps/20171220_conda/bin/../lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x4186)  [0x2aaaaae39046]
[ 8/40] : /global/common/software/cmb/edison/toast-deps/20171220_conda/bin/../lib/libpython3.6m.so.1.0(+0x16629e)                        [0x2aaaaae3429e]
[ 7/40] : /global/common/software/cmb/edison/toast-deps/20171220_conda/bin/../lib/libpython3.6m.so.1.0(PyEval_EvalCodeEx+0x6d)           [0x2aaaaae348cd]
[ 6/40] : /global/common/software/cmb/edison/toast-deps/20171220_conda/bin/../lib/libpython3.6m.so.1.0(PyEval_EvalCode+0x3b)             [0x2aaaaae3491b]
[ 5/40] : /global/common/software/cmb/edison/toast-deps/20171220_conda/bin/../lib/libpython3.6m.so.1.0(PyRun_FileExFlags+0xb2)           [0x2aaaaae6f472]
[ 4/40] : /global/common/software/cmb/edison/toast-deps/20171220_conda/bin/../lib/libpython3.6m.so.1.0(PyRun_SimpleFileExFlags+0xe7)     [0x2aaaaae6f5d7]
[ 3/40] : /global/common/software/cmb/edison/toast-deps/20171220_conda/bin/../lib/libpython3.6m.so.1.0(Py_Main+0xf2c)                    [0x2aaaaae8a6dc]
[ 2/40] : python3(main+0x16e)                                                                                                            [0x400bce]
[ 1/40] : /lib64/libc.so.6(__libc_start_main+0xf5)                                                                                       [0x2aaaabd536d5]
[ 0/40] : python3()                                                                                                                      [0x400c95]
tskisner commented 6 years ago

@zonca, I think this is a summary of your old and new data distributions:

Original:

New:

So some processes actually have more data in the new case... If you just kept the groupsize at 10, then you would have an identical amount of data per process:

Proposed:

One hypothesis was that Python was not releasing its memory pool back to the OS after temporary arrays are garbage collected. I made a small test script:

https://gist.github.com/tskisner/82bf69625825523cf025611dbcd3c96b

And ran this on cori for a variety of buffer sizes (in case the behavior was different depending on array size). In all cases, each process would take turns allocating and freeing 40GB of numpy arrays. Everything worked as expected.

So this seems to indicate that the memory management for numpy arrays is reasonable. It also means that the issue is that somehow the processes are allocating enough collective temporary arrays at the same time that we really do hit the limit (or there are still live references to the temporaries preventing garbage collection). Buffering the calculation would reduce the size of temporaries and might help.

My next step is to extract the boresight calculation function into a similar small test script and verify that I can use up all the memory using full-length temporaries and then add the buffering to see if it helps.

jdborrill commented 6 years ago

The memory reporting from Andrea's failed job also showed no problem in the boresight pointing generation (where the previous runs had shown the huge excess) but failed instead in MADAM. Furthermore immediately before the OOM failure the reported memory seemed entirely reasonable and well within the node's capabilities.

Julian

On Tue, Apr 24, 2018 at 9:34 AM, Theodore Kisner notifications@github.com wrote:

@zonca https://github.com/zonca, I think this is a summary of your old and new data distributions:

Original:

  • 30 observations (30 days)
  • 300 nodes = 30 groups of 10 nodes (80 procs) each
  • Observations assigned to each group = 30 obs / 30 groups = 1 x 24 hours
  • Approximate data per process = 24 hours / 80 procs = 0.3 hours

New:

  • 720 observations (30 days * 24 hours)
  • 300 nodes = 300 groups of 1 node (8 procs) each
  • Observations assigned to each group = 720 / 300 = (2 or 3) x 1 hour
  • Approximate data per process = (2 or 3) hours / 8 procs = 0.25 or 0.375 hours

So some processes actually have more data in the new case... If you just kept the groupsize at 10, then you would have an identical amount of data per process:

Proposed:

  • 720 observations (30 days * 24 hours)
  • 300 nodes = 30 groups of 10 nodes (80 procs) each
  • Observations assigned to each group = 720 obs / 30 groups = 24 x 1 hour
  • Approximate data per process = 24 hours / 80 procs = 0.3 hours

One hypothesis was that Python was not releasing its memory pool back to the OS after temporary arrays are garbage collected. I made a small test script:

https://gist.github.com/tskisner/82bf69625825523cf025611dbcd3c96b

And ran this on cori for a variety of buffer sizes (in case the behavior was different depending on array size). In all cases, each process would take turns allocating and freeing 40GB of numpy arrays. Everything worked as expected.

So this seems to indicate that the memory management for numpy arrays is reasonable. It also means that the issue is that somehow the processes are allocating enough collective temporary arrays at the same time that we really do hit the limit (or there are still live references to the temporaries preventing garbage collection). Buffering the calculation would reduce the size of temporaries and might help.

My next step is to extract the boresight calculation function into a similar small test script and verify that I can use up all the memory using full-length temporaries and then add the buffering to see if it helps.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hpc4cmb/toast/issues/199#issuecomment-383997941, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXFyjxo9JkRBcJaWFiIvpVZJjZB29RSks5tr1QOgaJpZM4RknXf .

tskisner commented 6 years ago

@zonca, it looks like I spoke too soon :-) I had spot checked some ranges of buffer sizes, but then also let the script above run from the beginning for several hours. On cori-haswell, running 8 processes per node, When allocating 2048 buffers of 19.5MB each, it was still behaving the same as the 1 x 40GB case. When moving to 4096 buffers of ~9.7MB each, each python process retained its large memory footprint even after del'ing the arrays. So it appears the python memory management has a phase transition in the way it allocates and frees buffers (I'm assuming that large objects use malloc/free and smaller objects use the pool).

I'll investigate more and see if this is a python compile-time setting. However, it does seem like buffering would help in this situation, since temporary objects would be small.

zonca commented 6 years ago

@tskisner I'd like to test https://gist.github.com/tskisner/82bf69625825523cf025611dbcd3c96b myself, what is a specific configuration that fails to free memory?

do you also have a slurm script to launch this?

tskisner commented 6 years ago

To save time, edit the "mindiv" value to something like 1024. Then get an interactive node on cori haswell using the anaconda install provided by toast-deps:

$>  module use /global/common/software/cmb/cori/modulefiles
$>  module load toast-deps
(just to check...)
$>  which python
$>  salloc -N 1 -C haswell --qos=interactive -t 04:00:00

Now launch a copy of top that you can use to watch the memory use:

$>  xterm -e top -o %MEM &

Then run the script on 8 processes:

$>  srun -n 8 -N 1 python pypool.py

For the 1024 and 2048 case, you can watch the memory for each process go up and then down. For the 4096 case, the memory goes up and stays high until the fourth process eventually uses up the remaining memory on the node.

tskisner commented 6 years ago

TOAST Memory Use

zonca commented 6 years ago

I've dug a bit more into this memory issue. What I see is that after 8192 tiny arrays, only about half memory is freed up, the rest remains allocated into the process. However at the next iteration inside that process, we allocate another 40GB of tiny arrays, but the memory usage never goes up to 40+20=60GB, but just goes back up to 40GB. So my guess is that this should not cause a node to go out of memory if you have 8 processes and each of them is using a max of 1/8th of the memory. Instead if you are doing anything in turns where 1 process is temporarily using let's say half of the node memory, un-freed memory could be a problem. It looks like this is the expected (and ugly) behavior, mostly due to how the OS manages memory, Python itself is calling free on all allocated memory, but the OS is not bothering in releasing this memory right away back to the pool. However next time that the process requests memory, it uses again what was not freed. See:

It looks like that for large arrays instead i.e. larger than tens of MB, memory is instead freed immediately, so using large arrays and keeping each, let's say, of the 8 processed below 1/8th memory should mitigate this issue.

jdborrill commented 6 years ago

Hi Andrea,

I thought the issue was that while that memory may still be available in the python pool for re-use by python allocations, it is unavailable for any non-python allocations (eg. in libMADAM).

Julian

On Wed, Apr 25, 2018 at 8:26 AM, Andrea Zonca notifications@github.com wrote:

I've dug a bit more into this memory issue. What I see is that after 8192 tiny arrays, only about half memory is freed up, the rest remains allocated into the process. However at the next iteration inside that process, we allocate another 40GB of tiny arrays, but the memory usage never goes up to 40+20=60GB, but just goes back up to 40GB. So my guess is that this should not cause a node to go out of memory if you have 8 processes and each of them is using a max of 1/8th of the memory. Instead if you are doing anything in turns where 1 process is temporarily using let's say half of the node memory, un-freed memory could be a problem. It looks like this is the expected (and ugly) behavior, mostly due to how the OS manages memory, Python itself is calling free on all allocated memory, but the OS is not bothering in releasing this memory right away back to the pool. However next time that the process requests memory, it uses again what was not freed. See:

It looks like that for large arrays instead i.e. larger than tens of MB, memory is instead freed immediately, so using large arrays and keeping each, let's say, of the 8 processed below 1/8th memory should mitigate this issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hpc4cmb/toast/issues/199#issuecomment-384328394, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXFygSA_5z2wZJKJ0UMX9dpjCTuJh0dks5tsJWzgaJpZM4RknXf .

zonca commented 6 years ago

@jdborrill I haven't read anything specific about this case. With the information I have now, I would guess that if it is the OS handling this, as long as it is the same process it shouldn't matter if it is Python or an extension. Also numpy is a compiled extension making allocations not through the Python system but directly in C. Anyway I will make a test.

I tested this, my guess was wrong, if I do a malloc in cython, it does not reuse the 20 GB that weren't released before. So if I malloc a 40GB array, the memory consumption goes up to 60 GB.

tskisner commented 6 years ago

I've conducted a small test on edison using 2 observations of 24 hours length and 74 detectors at 40Hz sampling. First, I note that the default in the example satellite pipeline is to cache detector quaternions, which I believe we should never do, since they are fast to regenerate from the boresight pointing.

I further did some work in a branch to buffer several parts of the code that works with pointing data. Here is memory use comparison running on one edison node (8 processes). This was collected manually with htop.

Step Master Buffered Expected Notes
start 3GB 3GB 3GB This was the memory in use at the start.
boresight 5.75GB 4.75GB 3.3GB Velocity not yet read, so 2 obs x 3456000 x (1 flag byte + 4 x 8 byte quaternion) = ~ 0.3GB increase.
pointing 22.8GB 21.3GB 19.7GB 2 obs x 74 dets x 4 nums x 8 bytes x 3456000 samp = ~16.4GB increase
PySM 30.6GB 29.5GB 24.3GB 2 obs x 74 dets x 8 bytes x 3456000 samp = ~4.1GB increase, plus flags of 0.5GB. Extra 4GB must be replicated map data? Note that high water mark during this step was ~36GB
dipole 30.7GB 29.5GB 24.3GB should cache velocity, and accumulate to existing signal
noise 35GB 33.5GB 28.4GB should add another TOD = ~4.1GB increase

Even for this simple case with a low sample rate, the buffered boresight calculation uses less memory. Also, I think these numbers are close to the expected ones (with the exception of a few GB of aggregate memory across 8 processes). I'll open a PR for review

jdborrill commented 6 years ago

Hmm ... seems like a pretty marginal gain, much smaller than the difference between expected and actual in either mode.

Julian

On Tue, May 1, 2018 at 12:32 AM, Theodore Kisner notifications@github.com wrote:

I've conducted a small test on edison using 2 observations of 24 hours length and 74 detectors at 40Hz sampling. First, I note that the default in the example satellite pipeline is to cache detector quaternions, which I believe we should never do, since they are fast to regenerate from the boresight pointing.

I further did some work in a branch to buffer several parts of the code that works with pointing data. Here is memory use comparison running on one edison node (8 processes). This was collected manually with htop. Step Master Buffered Expected Notes start 3GB 3GB 3GB This was the memory in use at the start. boresight 5.75GB 4.75GB 3.3GB Velocity not yet read, so 2 obs x 3456000 x (1 flag byte + 4 x 8 byte quaternion) = ~ 0.3GB increase. pointing 22.8GB 21.3GB 19.7GB 2 obs x 74 dets x 4 nums x 8 bytes x 3456000 samp = ~16.4GB increase PySM 30.6GB 29.5GB 24.3GB 2 obs x 74 dets x 8 bytes x 3456000 samp = ~4.1GB increase, plus flags of 0.5GB. Extra 4GB must be replicated map data? Note that high water mark during this step was ~36GB dipole 30.7GB 29.5GB 24.3GB should cache velocity, and accumulate to existing signal noise 35GB 33.5GB 28.4GB should add another TOD = ~4.1GB increase

Even for this simple case with a low sample rate, the buffered boresight calculation uses less memory. Also, I think these numbers are close to the expected ones (with the exception of a few GB of aggregate memory across 8 processes). I'll open a PR for review

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hpc4cmb/toast/issues/199#issuecomment-385616717, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXFyvePcNU0jDa39RM9qSdirKv-7dldks5tuA-agaJpZM4RknXf .

tskisner commented 6 years ago

This is only a marginal gain in this tiny example. In the case of a high sample rate, long observations, and few detectors per process (so the boresight is replicated more times) I think this will be larger. I will make more tests on this today.

On May 1, 2018, at 4:48 AM, jdborrill notifications@github.com wrote:

Hmm ... seems like a pretty marginal gain, much smaller than the difference between expected and actual in either mode.

Julian

On Tue, May 1, 2018 at 12:32 AM, Theodore Kisner notifications@github.com wrote:

I've conducted a small test on edison using 2 observations of 24 hours length and 74 detectors at 40Hz sampling. First, I note that the default in the example satellite pipeline is to cache detector quaternions, which I believe we should never do, since they are fast to regenerate from the boresight pointing.

I further did some work in a branch to buffer several parts of the code that works with pointing data. Here is memory use comparison running on one edison node (8 processes). This was collected manually with htop. Step Master Buffered Expected Notes start 3GB 3GB 3GB This was the memory in use at the start. boresight 5.75GB 4.75GB 3.3GB Velocity not yet read, so 2 obs x 3456000 x (1 flag byte + 4 x 8 byte quaternion) = ~ 0.3GB increase. pointing 22.8GB 21.3GB 19.7GB 2 obs x 74 dets x 4 nums x 8 bytes x 3456000 samp = ~16.4GB increase PySM 30.6GB 29.5GB 24.3GB 2 obs x 74 dets x 8 bytes x 3456000 samp = ~4.1GB increase, plus flags of 0.5GB. Extra 4GB must be replicated map data? Note that high water mark during this step was ~36GB dipole 30.7GB 29.5GB 24.3GB should cache velocity, and accumulate to existing signal noise 35GB 33.5GB 28.4GB should add another TOD = ~4.1GB increase

Even for this simple case with a low sample rate, the buffered boresight calculation uses less memory. Also, I think these numbers are close to the expected ones (with the exception of a few GB of aggregate memory across 8 processes). I'll open a PR for review

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hpc4cmb/toast/issues/199#issuecomment-385616717, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXFyvePcNU0jDa39RM9qSdirKv-7dldks5tuA-agaJpZM4RknXf .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

tskisner commented 6 years ago

I reconfirmed these results (similar memory use between branches with dipole keep_quats=False in master branch) with a pico band at 126Hz. I'm moving up to a band with more detectors and higher sample rate. It may be that this entire issue was just accidental caching of the detector quaternions, which we should not do (it is fast to recompute).

tskisner commented 6 years ago

Using one 24 hour observation of band 19 (254 detectors, 850Hz sampling), the memory use is largely consistent between master and the buffer branch. Conclusions:

The speed up is nice- are their any objections to using this caching technique in #236 (after I fix the unit tests)? I think that modulo map-domain objects and other small miscellaneous memory use, we understand what is going on. We should be aware that when running PySM there will be some extra GB of memory needed beyond timestreams.

tskisner commented 6 years ago

I have now successfully run this set of detectors (PICO band 19, 254 detectors at 850Hz) with 24 hour observations, simulating sky with PySM, adding dipole and noise and making a map with madam. I ran 30 days on 30 nodes on both edison and cori-knl.

This was run using the branch in #236. Attached here are the slurm script and parameter files for the cori run: pico_cori.slurm.txt pico_madam.par.txt pico_scanning.par.txt

As far as I know, merging that PR will close this issue.

zonca commented 6 years ago

thanks, I am running my PICO full focal plane noise run with @tskisner 's buffered branch