igmk / pamtra

Passive and Active Microwave TRAnsfer model
GNU General Public License v3.0
20 stars 16 forks source link

Computation speed and memory conservation #48

Open ephraims28 opened 1 year ago

ephraims28 commented 1 year ago

Hi, I am at the point where I am trying to scale up the size of my Pamtra runs and am running into memory and computation speed issues. I have a few questions.

My input WRF file is 130m resolution and has dimensions of 8400x3780x135 (lat x lon x height). So far, I've been doing runs on a 100x100x135 slice of the model data. When running this slice at a single frequency, (when there are hydrometeors where scatting is important like snow), it takes about 10 minutes to run Pamtra on a Cheyenne batch node with 36 CPUs using:

pam.runParallelPamtra(freqs,pp_deltaX=10,pp_deltaY=10,pp_deltaF=1,pp_local_workers='auto')

Extrapolating that out to running Pamtra over my full domain and 9 frequencies would result in a runtime of around 200 days. I also have run into issues with using too much memory on my node.

To reduce my memory usage, I modified core.py to prevent the construction of the large radar output arrays for when I am only in passive mode. I used the same methodology you used for disabling saving the psd. This seems to have helped.

Another idea I had is to reduce the number of output angles for passive remote sensing. I only care about zenith measurements and don't need 32 output angles. I couldn't find a namelist parameter for this, but saw there was a variable called self._nangles in core.py. I changed that to 1, but it looked like that only changed the dimensions of tb for the final output, not the individual jobs that are joined together. So I had a dimension mismatch error. Is there an easier way to change the number of output angles? Would this not only conserve memory, but also speed up computations?

When running parallel pamtra, is it only distributing tasks over the CPUs on a single node? Or if I added nodes to my batch job, would pamtra run across the CPUs on all my nodes?

Any other suggestions to speed up my computation time?

Thanks, Sam

DaveOri commented 1 year ago

Hello again Sam, pamtra is designed for flexibility in order to facilitate adaptations to various inputs and consistency with the input microphysical assumptions, not speed. Scalability can be an issue as in any other scientific software, however, considering the massively parallel concept of the independent column approximation it is always possible to split the input into manageable chunks.

It seems to me that you are not facing technically an issue with the code but rather one with your specific problem/architecture. In these cases, developers can only provide advice based on experience and not a real fix.

The only thing I might have to point you toward is issue #24 this is caused by a bad interaction between the python subprocess module and the thread-safety of the popular BLAS numerical library. If pamtra links dynamically to a high-performance version of BLAS too many threads are spawned simultaneously, causing severe traffic jams in the system scheduler and sadly very bad performances. If you are using openblas, you should have this issue. The solution is to add to your batch script, after the module loading of the openblas library the following: export OPENBLAS_NUM_THREADS=1 this works for a bash shell, you need to adapt for different shells. Alternatively, you can also try in your python script to execute os.environ['OPENBLAS_NUM_THREADS'] = 1 before importing pyPamtra. All of this works, if you use openblas, other high-performance blas implementations (like intel mkl) might present the same issue, I just do not have enough experience to tell by heart how to set the multithreading behavior of all of those libraries.

Regarding the memory problem. It also depends on how much memory you have available and how the OS behaves regarding python memory allocation, again can't really help here. I can try to imagine what might be happening based on your input. It is very likely I will be wrong because... well I need to heavily guess. I will also have to guess a lot for the computing time problem. If not too heavy you can try posting your scripts next time.

My input WRF file is 130m resolution and has dimensions of 8400x3780x135 (lat x lon x height). So far, I've been doing runs on a 100x100x135 slice of the model data.

Are you doing the slicing within the same python script? perhaps with xarrays or some other lazy-evaluating library? This might give you the wrong feeling that most of the memory and computing time is spent on pamtra, while the data loading and slicing is already taking a burden. I might suggest to split your domain into smaller chunks beforehand with cdo or some other efficient tool for data manipulation.

100x100x135 slice of the model data. When running this slice at a single frequency, (when there are hydrometeors where scatting is important like snow), it takes about 10 minutes to run Pamtra on a Cheyenne batch node with 36 CPUs using: pam.runParallelPamtra(freqs,pp_deltaX=10,pp_deltaY=10,pp_deltaF=1,pp_local_workers='auto')

with this configuration, you have 10k columns which you divide into chunks of 100 columns each, meaning 100 parallel jobs for 36 workers. There might be an inefficient allocation of the resources here. You can try pp_deltaY=5 to double the number of jobs and reduce the job size. Not sure about the result, it might even increase the computing time (there is always some overhead when you submit a chunck of jobs to a free worker), but worth trying. Also, try to set explicitly the pp_local_workers=36, or, if you have an SMT cpu with 18 cores and 36 threads you can also test pp_local_workers=18, in some cases multithreading is detrimental to performances.

Extrapolating that out to running Pamtra over my full domain and 9 frequencies would result in a runtime of around 200 days. I also have run into issues with using too much memory on my node

Is 10 minutes the time for running pamtra or the time for running the entire script? Again loading the whole dataset takes time. Try also running all 9 frequencies at once (it should not increase memory much) because you do not need to load the same dataset 9 times for 9 different frequencies. If I/O is your limitation, this can help cutting down your estimates.

To reduce my memory usage, I modified core.py to prevent the construction of the large radar output arrays for when I am only in passive mode. I used the same methodology you used for disabling saving the psd. This seems to have helped.

I doubt it reduced the memory allocation significantly, but I might be wrong, without knowing exactly what has been modified it is rather difficult. I have to say that you seem happy to go into low-level mods as first step, skipping all the higher-level options you might have. It is obviously fine, but I would like to remind you about 2 issues: 1 boring reinstallation at every mod (which you definitely know already); 2 loss of dev support due to divergence of code versions (we can't help you at all if something goes wrong). So keep up with the good work but with caveats.

Another idea I had is to reduce the number of output angles for passive remote sensing. I only care about zenith measurements and don't need 32 output angles.

32 angles are needed for making the angular integral of the radiation fluxes within the adding-doubling method of RT4. You need them, even if in the end you only want to know the result at 1 specific direction

When running parallel pamtra, is it only distributing tasks over the CPUs on a single node? Or if I added nodes to my batch job, would pamtra run across the CPUs on all my nodes?

That depends 100% on your system. pamtra does not know the underlying architecture, it can only ask the OS about the number of cpus when you set the number of local workers to 'auto'. In principle, it can go over multiple nodes but usually, the process isolation makes the memory management quite troublesome (again, depends on your system and your scheduler). My best piece of advice here is again to make things simpler: split your data into manageable chuncks, e.g. 100 pieces 84x3780, and make a shell script that submits 100 jobs to your scheduler on the different smaller pieces.

Sorry for all the guessing Davide

ephraims28 commented 1 year ago

Thanks for all your ideas! My issues are not in the I/O side. The runtimes that I mentioned were just for running pam.runParallelPamtra as outputted by setting pyVerbose to 1. Adding the openblas command to my python script didn't change anything. Nor did setting pp_local_workers manually. Making the pp_deltax and pp_delta_y smaller helped a little, but not much, decreasing the runtime from 10min to 9min for a single frequency. (When I disable scattering for snow, the runtime falls to 4min, but I will have to keep that enabled for my experiment) Running all 9 frequencies took 83 minutes.

Also, thanks for the note about the use of the various angles even if I don't need that in my final output.

Even submitting chunks of anything near the size your mentioning would be much too large to finish in my 12 hr wallclock limit. I think I'll resort to some type of random sampling of my domain which should be sufficient for my goal of making a retrieval.

DaveOri commented 1 year ago

Hi Sam,

  1. Even though it is unlikely in your case (I can say that because it seems you spend a lot of time on snow scattering), lazy-evaluating things might shift the I/O time into the computing time. So, I am 99% convinced it is not your problem, but it might be in general
  2. General comment: remember to switch off verbosity once you go into production ;)
  3. What are you using for snow scattering? can you post the full descriptorfile?
  4. I do not know about the limitations of your system. I just wanted to mention that if you manage to find a feasible chunk of data for 1 job submission you can just split your domain in N chunks and submit N jobs. If your scheduler allows arrays of jobs you can do that even in just 1 batch script. For example, your 100x100 with 9 frequencies might be ok, just submit 3000 of them; the scheduler should run them in parallel
  5. Random sampling is a good idea for your application. Maybe you might also find useful to perform some data selection in advance, like skipping clear sky pixels or something, again, depends on your application. You are the expert here
ephraims28 commented 1 year ago

The descriptor file I'm using is similar to the ACLOUD example with a few changes:

descriptorFile = np.array([

['hydro_name' 'as_ratio' 'liq_ice' 'rho_ms' 'a_ms' 'b_ms' 'alpha_as' 'beta_as' 'moment_in' 'nbin' 'dist_name' 'p_1' 'p_2' 'p_3' 'p_4' 'd_1' 'd_2' 'scat_name' 'vel_size_mod' 'canting']

    ('cwc_q', 1.0,  1, -99.0,   -99.0, -99.0,  -99.0, -99.0, 23, 100, 'mgamma', -99.0, -99.0,   2.0,    1.0,   2.0e-6,   8.0e-5, 'mie-sphere', 'corPowerLaw_24388657.6_2.0', -99.0),
    ('iwc_q', 1.0, -1, -99.0, 1.58783,  2.56,  0.684,   2.0, 23, 100, 'mgamma', -99.0, -99.0, 1.564, 0.8547, 1.744e-5, 9.369e-3, 'ssrg-rt3',   'corPowerLaw_30.606_0.5533',  -99.0),
    ('swc_q', 0.6, -1, -99.0,   0.038,   2.0, 0.3971,  1.88, 23, 100, 'mgamma', -99.0, -99.0,   1.0,    1.0,  5.13e-5, 2.294e-2, 'ssrg-rt3',   'corPowerLaw_5.511054_0.25',  -99.0),
    ('gwc_q', 1.0, -1, -99.0,  500.86,  3.18,  -99.0, -99.0, 3, 100, 'mgamma', -99.0, 1e-3,  5.37,   1.06,  2.11e-4,   1.3e-2, 'mie-sphere', 'corPowerLaw_406.67_0.85',    -99.0),
    ('rwc_q', 1.0,  1, -99.0,   -99.0, -99.0,  -99.0, -99.0, 13, 100, 'mgamma', -99.0, -99.0,   2.0,    1.0,  0.00012,   8.2e-3, 'mie-sphere', 'corPowerLaw_494.74_0.7031',  -99.0)],
    dtype=[('hydro_name', 'S15'), ('as_ratio', '<f8'), ('liq_ice', '<i8'), ('rho_ms', '<f8'), ('a_ms', '<f8'), ('b_ms', '<f8'), ('alpha_as', '<f8'), ('beta_as', '<f8'), ('moment_in', '<i8'), ('nbin', '<i8'), ('dist_name', 'S15'), ('p_1', '<f8'), ('p_2', '<f8'), ('p_3', '<f8'), ('p_4', '<f8'), ('d_1', '<f8'), ('d_2', '<f8'), ('scat_name', 'S20'), ('vel_size_mod', 'S30'), ('canting', '<f8')] 
    )

The problem with submitting thousands of jobs of 100x100 chunks is even though the actual time for all the chunks to run would be reasonable, I'd quickly go through my allowance of 400,000 core hours.

Good points on the random sampling.

DaveOri commented 1 year ago

Hey Sam, sorry, I was out of office it took some time, so, your DF seems ok from the standpoint of performances. Perhaps I can just mention that maybe, since ice and cloud droplets have a limited size range, you can reduce the number of bins there to 50 or even 10, I do not think it will make a difference at all.

It is a little odd to me that you have such a nice variety of input moments, but I admit I lack first hand experience with wrf and all of its microphysical modules, perhaps, since it is resource-sensitive it is worth double check that the DF values match you microphysical assumptions.

Otherwise, good luck