BrunoGrandePhD commented 8 years ago

Description

Join us for a beginner-level introduction to MPI (Message Passing Interface), an industry standard library for distributed-memory parallel computing on large systems. There are implementations of MPI for all major compiled and interpreted languages, including C/C++, Python, and R, and it's a default parallel computing library on all academic HPC systems, including the Compute Canada clusters. Using a simple example, we will learn how to partition your calculation on multiple processors, and how to use basic send/receive and broadcast commands.

If you are thinking of parallelizing a long and/or large-memory calculation, this is a session to attend. If you already have a problem in need of parallelization, please bring it along for an in-class discussion.

Time and Place

Where: Simon Fraser University, Burnaby Campus, Library Research Commons When: Monday, August 8th, 10:30-11:30 am

Registration

REGISTER HERE

Required Preparation

Software Dependencies

SSH

OS X and Linux: Should be pre-installed
Windows: MobaXterm
WestGrid Account

You can find the details on how to obtain a WestGrid account here . If you need a sponsor (who would normally be your supervisor) to apply for an account, please contact Alex Razoumov, and he'll send you his CCRI which you need to fill in the application form.

⟶ It would be best if all attendees get their accounts few days before the workshop.

If you don't have an account, you can still attend the workshop, but you won't be able to do hands-on exercises -- you'll still be able to watch the presentation and participate in the discussion.

Notes

Basic concepts

Why do you want to parallelize your code?

(1) to speed up calculation or data processing breaking it into smaller pieces run in parallel
(2) the problem does not fit on a single node

Some people talk about task parallelism vs. data parallelism - the divide is not always clear, so I prefer not to use this terminology.

There are also embarrassingly parallel problems: often can simply do serial farming, no need to parallelize.

In general, whatever parallelization strategy you choose, you want to minimize the amount of communication. Remember: I/O and network are usually bottlenecks -- for reasons look into the history of computing.

Amdahl's law

serial vs. parallel chunks
in a more complex version of it different parts of your algorithm with scale differently
always analyze parallel scaling before doing anything large-scale
I/O can be parallel, but common sense applies
Architectures and programming models

Parallel hardware architecture determines the parallel programming model.

shared-memory parallel systems
distributed-memory parallel systems (clusters)

Distributed-memory programming paradigms:

map and reduce (Hadoop, Spark) - designed for parallel data processing, usually require a dedicated hardware setup (and not an easy-to-use layer on top of a regular cluster/scheduler); very simple problems on ridiculously large datasets
message passing interface (MPI) - fairly low-level library, but lots of built-in performance optimization; industry standard on HPC systems for the past ~20 years; many higher-level frameworks use MPI underneath
high-level parallel languages and frameworks (Unified Parallel C, Coarray Fortran, Charm++, Chapel, Trillinos) - can be quite powerful, but haven't seen industry-wide adoption, promoted by their developers and various interested parties
Code efficiency and common sense

Python and R have terrible native performance, so it might not be such a good idea to parallelize these codes in the first place! Both are interpreted scripting languages designed for ease-of-use and high level of abstraction, not for performance. There are exceptions to this rule - can you name them?

Try to optimize your algorithm before parallelizing it! Don't use inefficient algorithms, silly data constructs, slow languages (including Java), don't reinvent the wheel coding everything from scratch. Do use precompiled libraries, optimization flags, profiling tools, think of the bottlenecks in your workflow and of the overall code design.

Always think of the bottlenecks! For example, with data processing, running 100 I/O-intensive processes on 100 cores on a cluster will not make it 100X faster - why?

Python vs. C timing

Let's compute \pi via numerical integration.

eq1

eq2

eq3

First let's take a look at the serial code pi.c

#include <stdio.h>
#include <math.h>
#define pi 3.14159265358979323846

int main(int argc, char *argv[])
{
  double h, sum, x;
  int n, i;

  n = 1000000;
  h = 1./n;
  sum = 0.;

  for (i = 1; i <= n; i++) {
    x = h * ( i - 0.5 );
    sum += 4. / ( 1. + pow(x,2));
  }

  sum *= h;
  printf("%.17g  %.17g\n", sum, fabs(sum-pi));

  return 0;
}

$ gcc pi.c -o pi
$ time ./pi
3.1415926535897643  2.886579864025407e-14
time will be around 50ms

Can ask the compiler to optimize the code.

$ gcc -O2 pi.c -o pi
$ time ./pi
3.1415926535897643  2.886579864025407e-14
time will be around 10ms

Now let's look at the same algorithm in pi.py

from math import pi

n = 1000000
h = 1./n
sum = 0.

for i in range(n):
    x = h * ( i + 0.5 )
    sum += 4. / ( 1. + x**2)

sum *= h
print(sum, abs(sum-pi))

$ time python pi.py
3.1415926535897643 2.886579864025407e-14
time will be over 500ms

This is 50X performance drop compared to compiler-optimized C code on my laptop! On a cluster's compute node I get 80X. If you code a PDE solver with a native code, you'll likely see a 100X-300X drop in performance when switching to Python.

Then why use Python?

wonderful scripting language: can easily write clear and concise code, ease of use
Python works much better with list manipulations, string comparisons, etc
lots of precompiled 3rd-party numerical libraries

Does it make sense to parallelize a Python code?

yes, but only if your Python code spends most of its time inside precompiled libraries
in all other cases really need to rethink your approach before parallelizing, or use a compiled language
in what follows, we'll parallelize our Python code just to learn the basics of MPI (the same in all languages) -- the resulting code is terribly inefficient and should not be used for production!
Parallelization

Cluster environment

login and compute nodes
job scheduler
perhaps a small number of development/interactive nodes outside of the scheduler (not in WestGrid)

Normally on a cluster you need to submit jobs to the scheduler. Results (output and error files tagged #with the jobID) usually go into the directory into which you cd inside the job's submission script.

$ qsub production.bat
$ qstat -u username
$ showq -w user=username
$ qdel jobID

For debugging and testing you can start an interactive job specifying the resources from the command line. The job will start on "interactive" nodes with shorter runtime limits (to ensure that your job will start soon).

$ qsub -I -l nodes=1:ppn=1,walltime=0:30:00,pmem=2000mb
... wait for the shell to start ...
$ mpiexec parallelCode   # will run on the allocated (to your job) number of processors

However, this might take a while (from seconds to many minutes) if the system is busy and the scheduler is overwhelmed, even if the "interactive" nodes are idle. For the duration of this workshop only, we reserved a node cl2n230 on Jasper cluster where you can work interactively.

$ ssh jasper.westgrid.ca
$ ssh cl2n230
$ module load library/openmpi/1.6.5-gnu
$ module load application/python/2.7.3

Plan B solution: if this does not work, you can use an interactive node (b402 or b403) on bugaboo (Ok for quick testing only!).

Once you are on the interactive node, cd into a temporary directory and be prepared to run a parallel code.

$ cd newUserSeminar
$ etime() { /usr/bin/time -f "elapsed: %e seconds" $@; } # formatted output works only in Linux
$ mpirun -np numProcs parallelCode

We'll now take a look at mpi4py, an MPI implementation for Python. There are two versions of each MPI command in mpi4py:

upper-case commands, e.g., Bcast(), use numpy arrays for datatypes (faster) -- we'll use this method
lower-case commands, e.g., bcast(), use "pickling" method, packing any Python object into a bytestream (slower)

Let's try running the following code (parallelPi.py) adding the lines one-by-one:

from math import pi
from mpi4py import MPI
import numpy as np
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

n = 1000000
h = 1./n
sum = 0.

# if rank == 0:
#     print 'Calculating PI with', size, 'processes'

# print 'process', rank, 'of', size, 'started'

for i in range(rank, n, size):
    # print rank, i
    x = h * ( i + 0.5 )
    sum += 4. / ( 1. + x**2)

local = np.zeros(1)
total = np.zeros(1)
local[0] = sum*h
comm.Reduce(local, total, op = MPI.SUM)
if rank == 0:
    print total[0], abs(total[0]-pi)

$ etime mpirun -np 4 python parallelPi.py

Compare the runtimes to the serial code

$ etime python pi.py

For n = 1,000,000 we get a slowdown from 0.8s to 1.9s! However, for n = 10,000,000 we get a speedup from 5.6s to 3.3s. And for n = 100,000,000 we get a speedup from 54s to 16s, getting closer to 4X.

Now let's compare Python MPI syntax to C MPI in parallelPi.c

#include <stdio.h>
#include <math.h>
#include <mpi.h>

int main(int argc, char *argv[])
{
  double total, h, sum, x;
  int n, rank, numprocs, i;

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &numprocs);

  n = 10000;
  h = 1./n;
  sum = 0.;

  if (rank == 0)
    printf("Calculating PI with %d processes\n", numprocs);

  printf("process %d started\n", rank);

  for (i = rank+1; i <= n; i += numprocs) {
    x = h * ( i - 0.5 );    //calculate at center of interval
    sum += 4.0 / ( 1.0 + pow(x,2));
  }

  sum *= h;
  MPI_Reduce(&sum,&total,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);

  if (rank == 0) {
    printf("%f\n",total);
  }

  MPI_Finalize();
  return 0;
}

$ mpicc parallelPi.c -o pi
$ etime mpirun -np 4 ./pi

Reduce() is an example of a collective operation.

Major MPI functions

Examples of point-to-point communication functions are Comm.Send(buf, dest = 0, tag = 0) and Comm.Recv(buf, source = 0, tag = 0, Status status = None).

Examples of collective communication functions are Comm.Reduce(sendbuf, recvbuf, Op op = MPI.SUM, root = 0), Comm.Allreduce(sendbuf, recvbuf, Op op = MPI.SUM) -- where the reduction operation could be MPI.MAX, MPI.MIN, MPI.SUM, MPI.PROD, MPI.LAND, MPI.BAND, MPI.LOR, MPI.BOR, MPI.LXOR, MPI.BXOR, MPI.MAXLOC, MPI.MINLOC -- and Comm.Bcast(buf, root=0) (sending same to all) and Comm.Scatter(sendbuf, recvbuf, root) (sending parts to all).

In the C MPI library there are 130+ communication functions. Probably there are quite a few in mpi4py.

Point-to-point example

Here is a code (point2point.py) demonstrating point-to-point communication, sending a number to the left in a round chain:

from math import pi
from mpi4py import MPI
import numpy as np
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

left = rank - 1
if rank == 0: left = size - 1

right = rank + 1
if rank == size - 1: right = 0

# print rank, left, right

square = np.zeros(1)
result = np.zeros(1)
square[0] = float(rank)**2
comm.Send(square, dest = left)
comm.Recv(result, source = right)

print rank, result[0]

Exercise: write an MPI code for two processors in which each processor sends a number to the other one and receives a number. Use separate comm.Send() and comm.Recv() functions. Is their order important?

Exercise: write another code to compute \pi using a series

eq4

on an arbitrary number of processors, i.e., the code should work on any number of cores. Run it on 1, 2, 4, 8 cores and measure speedup. Do you get 100% parallel efficiency?

Discussion: how would you parallelize the diffusion equation?

eq5

Discussion: how would you parallelize your own problem?

Attending SciProg Organizers

@theavanrossum

BrunoGrandePhD commented 8 years ago

@razoumov, when you get a chance, comment with the details relating to your workshop so I can update the "TBA" values. Or you can edit the issue directly.

BrunoGrandePhD commented 8 years ago

@razoumov, when you get a chance after your trip, could you confirm whether SSH is all that is necessary for your workshop? That would mean it's pre-installed on OS X and Linux and something like PuTTY needs to be installed on Windows computers.

razoumov commented 8 years ago

@brunogrande Participants will need ssh (for Windows we recommend http://mobaxterm.mobatek.net but really any client will do) and a WestGrid account. You can find the details on how to obtain an account at https://www.westgrid.ca/support/accounts/getting_account . If attendees need a sponsor (who is usually their supervisor), please tell them to contact me, and I'll send them my CCRI which they need to fill in the application form. It would be best if all attendees get their accounts few days before the workshop.

BrunoGrandePhD commented 8 years ago

Thanks, @razoumov!

lpix commented 8 years ago

@razoumov do you prefer attendees to contact you by e-mail or through this issue? I am sending out an e-mail friday afternoon-ish and I don't want to be the reason you get spammed in the future :)

razoumov commented 8 years ago

@lpix They can contact me via my WestGrid email listed at https://www.westgrid.ca/about_westgrid/staff_committees/staff . Thanks!

sciprog-sfu / sciprog-sfu.github.io

"Parallel Computing" by Alex Razoumov #103

Description

Time and Place

Registration

Required Preparation

Software Dependencies

SSH

WestGrid Account

Notes

Basic concepts

Architectures and programming models

Code efficiency and common sense

Python vs. C timing

Parallelization

Major MPI functions

Point-to-point example

Attending SciProg Organizers