Example of usage with MPI parallelism

ashwinvis commented 1 year ago

While in the article you state that

In particular, MPI is used to divide time and depth

I did not find an example which demonstrates this in the Tutorials. We only see OpenMP being used and SLURM_NTASKS is always set as 1. Would it be possible to construct a simple example which shows MPI parallelism. This is necessary to check out from https://github.com/openjournals/joss-reviews/issues/4277#issuecomment-1383020819:

Functionality: Have the functional claims of the software been confirmed?

bastorer commented 1 year ago

It hadn't even occurred to me that I didn't have any MPI tutorials. Thanks for point that out! I'll fix that this week :-)

bastorer commented 1 year ago

Took a little longer than planned, sorry, but an MPI tutorial is now included. It follows almost exactly the 'Low Resolution' spherical case for consistency, but with multiple depth levels to allow multiple MPI ranks. When used on the recommended number of processors, it runs in a couple minutes.

ashwinvis commented 1 year ago

40 processors is a bit much, so I have to get a supercomputer to test it. Would it be possible to reduce the requirements such that even with 8 processors you get something in a few minutes?

bastorer commented 1 year ago

I've updated the ABOUT_TUTORIAL.md to include a note / instructions on how to adjust the tutorial (changing a single number in the generate_data_sphere.py scripts) to allow the tutorial to run on fewer processors in reasonable time.

The default setting is 24 processors running for ~10 minutes.

Does that seem reasonable?

Reducing the MPI-requirement

24 processors is a fairly heavy requirement if you are not running on a computing cluster. You can simply run on fewer processors (highest efficiency if the number of processors divides evenly into 48 - the number of vertical levels), but at the cost of increasing the runtime.

To reduce the processor cost without increasing runtime, you can decrease the number of vertical levels proportionately. E.g. you can reduce the vertical levels to 12 in order to run on 6 processors in a similar amount of time.

To adjust the number of vertical levels, you can adjust line 13 of generate_data_sphere.py, which reads Nlon, Nlat, Ndepth = int(360//2), int(180//2), 48. The last number, 48, specifies the number of vertical levels.

When running the code, you can use any number of MPI ranks up to the number of verticals levels, but the most efficient use of processors occurs when the number of MPI ranks divides evenly into the number of vertical levels.

bastorer commented 1 year ago

For something that runs in ~5 minutes on 8 processors, setting the number of vertical levels to 8 (last number on line 13 of generate_data_sphere.py, change from 48 to 8) should do the trick.

ashwinvis commented 1 year ago

Managed to run with 8 vertical levels and 4 processors in a laptop for ~30 minutes.

husseinaluie / FlowSieve

Example of usage with MPI parallelism #29

Reducing the MPI-requirement