jasperzhong commented 3 years ago

写MPI程序让我想起了写CUDA. 有点像.

jasperzhong commented 3 years ago

ubuntu install MPICH

sudo apt-get install mpich

MPI Hello World

MPI defined constants. see this page. https://linux.die.net/man/3/mpi_comm_world

MPI_COMM_WORLD: communicator, type MPI_Comm, contains all of the processes

编译用mpicc/mpic++.运行用mpirun

mpirun -np 4 ./mpi_hello_world

jasperzhong commented 3 years ago

Blocking Send & Recv

MPI_Send https://www.mpich.org/static/docs/v3.3/www3/MPI_Send.html

阻塞直到消息被目标进程接收.

MPI_Send(
    void* data,
    int count,
    MPI_Datatype datatype,
    int destination,
    int tag,
    MPI_Comm communicator)

count代表的是发送长度(#elements).

MPI_Recv https://www.mpich.org/static/docs/latest/www3/MPI_Recv.html

阻塞接收消息.

MPI_Recv(
    void* data,
    int count,
    MPI_Datatype datatype,
    int source,
    int tag,
    MPI_Comm communicator,
    MPI_Status* status)

count代表的是最大接收长度(#elements). 真实的长度可以用MPI_Get_count拿到.

ring那个例子很有趣，实现了0 -> 1 -> ... -> n - 1 -> 0这样的发送逻辑.

Dynamic Length

另外MPI_Status用来实现动态的数据长度发送. 用MPI_Get_count获得真实的received长度. 有两种方法:

第一种是把status放在MPI_Recv里面，然后调用MPI_Get_count获得真实长度
第二种是主动用MPI_Probe获取长度，然后直接用这个长度作为最大接收长度.

jasperzhong commented 3 years ago

Collective Communication

Collective Communication是同步操作——所有进程都执行到这个操作，才会开始执行.

One of the things to remember about collective communication is that it implies a synchronization point among processes. This means that all processes must reach a point in their code before they can all begin executing again.

MPI专门有一个Sync API: MPI_Barrier

int MPI_Barrier(MPI_Comm comm)

mpich官方文档对此的解释是

Blocks the caller until all processes in the communicator have called it; that is, the call returns at any process only after all members of the communicator have entered the call.

MPI_Barrier示例图:

MPI_Barrier的一种最简单的实现方式是通过ring-like的方式实现.

Broadcast

函数定义.

int MPI_Bcast( void *buffer, int count, MPI_Datatype datatype, int root, 
               MPI_Comm comm )

需要有一个root process——发送数据的，其他process是接收数据的.

最简单的实现是root向其他non-root process一个一个发送，但这样是非常低效的——因为其他process接收到数据后，可以做转发！记作my_cast.

所以一个高效的实现是tree-based的，如下图所示：

stage 1: 0 -> 1
stage 2: 0 -> 2, 1 -> 3
stage 3: 0 -> 4, 1 -> 5, 2 -> 6, 3 -> 7

每个stage参与扩散的进程数量翻倍！MPI_Bcast的时间复杂度是O(log N)而不是O(N)！

还有一种实现方法是scatter then allgather（对于长消息）

1亿个integers做broadcast，重复10次实验，取平均时间. 对比my_cast和MPI_Bcast性能，结果如下表:

Processors	my_cast	MPI_BCast
2	0.072	0.066
4	0.256	0.187
8	0.760	0.576
16	1.780	0.815
32	3.547	1.295
64	8.098	2.864

可视化一下. MPI_Bcast的scalability明显更好.

jasperzhong commented 3 years ago

Scatter, Gather, Allgather

MPI_Scatter 函数定义. 注意sendcount是per process的，不是总共的. 比如sendcount是1，那么进程0拿到第一个element，进程1拿到第二个element. 如果sendcount是2，那么进程0拿到第一个和第二个elements，进程1拿到第三个和第四个elements，以此类推.

int MPI_Scatter(const void *sendbuf, int sendcount, MPI_Datatype sendtype,
                void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)

MPI_Gather

函数定义. 注意这里的recvcount和MPI_Scatter中的sendcount类似，也是per process的数量，而不是总共的数量.

int MPI_Gather(const void *sendbuf, int sendcount, MPI_Datatype sendtype,
               void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)

并行avg这个例子不错: root进程先把数据scatter到各个进程上，然后每个进程算各自block的平均值，然后root进程gather各个进程的local avg，最后再得到global avg.

流程：scatter -> gather. 听起来怎么很熟悉？好像在哪里见过.

MPI_Allgather 目前接触到的collective communication (bcast, scatter, gather)都是 many-to-one或者one-to-many的communication. 而many-to-many的communication也很常用.

Allgather = Gather + Bcast

示意图

函数定义. 和MPI_Gather很像，但是没有root进程.

int MPI_Allgather(const void *sendbuf, int sendcount, MPI_Datatype sendtype,
                  void *recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm)

最后parallel rank例子是个不错的练习. 流程是: gather -> rank -> scatter.

jasperzhong commented 3 years ago

Reduce, All-Reduce

MPI_Reduce 函数定义. MPI_Op是reduce操作，常见的有MIN, MAX, SUM, PROD等等.

int MPI_Reduce(const void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype,
               MPI_Op op, int root, MPI_Comm comm)

用MPI_Reduce求global sum的例子.

MPI_Allreduce 哈哈，终于到这个函数了，老熟了. Allreduce = Reduce + Bcast.

函数定义. 和MPI_Reduce基本一样，除了没有root.

int MPI_Allreduce(const void *sendbuf, void *recvbuf, int count,
                  MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)

举的例子是算标准差. 需要一次All-Reduce算mean和一次Reduce求和sq diff.

jasperzhong commented 3 years ago

Groups and Communicators

MPI_Comm_split 创建communicator可以通过MPI_Comm_split函数

int MPI_Comm_split(MPI_Comm comm, int color, int key, MPI_Comm *newcomm)

同color的会被归为一个communicator. 其中color是非负的或者是MPI_UNDEFINED(代表不被分配new communicator). key决定了rank分配.

一种用法

MPI_Comm_create_group

另一种方法更直接一些. 用MPI_Group来创建. MPI_Group和MPI_Comm的关系是: MPI_Comm = id + MPI_Group.

函数定义. 可以用MPI_Group直接创建一个MPI_Comm. 很方便.

int MPI_Comm_create_group(MPI_Comm comm, MPI_Group group, int tag, MPI_Comm * newcomm)

还需要用到一个函数MPI_Group_incl. 根据ranks从group里面选择进程作为新的group.

int MPI_Group_incl(MPI_Group group, int n, const int ranks[], MPI_Group * newgroup)

例子举的是创建一个素数rank的group.

jasperzhong commented 3 years ago

漏了一个重要的内容—— mpirun

mpirun

一般的用法就是

mpirun -np <number of processes> <program name and arguments>

从文档来看，n/np都可以.

除此之外，还可以带env参数. 分global env和local env两种:

全局环境变量: genv {name} {value}
本地环境变量: env {name} {value}

另外常用的是host file

f {name}

也可以直接写 -hosts ...

mpirun居然还有checkpoint？？

https://www.mpich.org/static/downloads/3.3.1/mpich-3.3.1-userguide.pdf 看了下文档. 看上去是把一个node上的所有进程的状态保存成一个文件.

jasperzhong / cs-notes

learn MPI #18

MPI Hello World

Blocking Send & Recv

Dynamic Length

Collective Communication

Broadcast

Scatter, Gather, Allgather

Reduce, All-Reduce

Groups and Communicators

mpirun