jasperzhong / cs-notes

CS认知体系
6 stars 0 forks source link

learn MPI #18

Closed jasperzhong closed 3 years ago

jasperzhong commented 3 years ago

https://mpitutorial.com/tutorials/

写MPI程序让我想起了写CUDA. 有点像.

jasperzhong commented 3 years ago

ubuntu install MPICH

sudo apt-get install mpich

MPI Hello World

MPI defined constants. see this page. https://linux.die.net/man/3/mpi_comm_world

编译用mpicc/mpic++.运行用mpirun

mpirun -np 4 ./mpi_hello_world

image

jasperzhong commented 3 years ago

Blocking Send & Recv

MPI_Send https://www.mpich.org/static/docs/v3.3/www3/MPI_Send.html

阻塞直到消息被目标进程接收.

MPI_Send(
    void* data,
    int count,
    MPI_Datatype datatype,
    int destination,
    int tag,
    MPI_Comm communicator)

count代表的是发送长度(#elements).

MPI_Recv https://www.mpich.org/static/docs/latest/www3/MPI_Recv.html

阻塞接收消息.

MPI_Recv(
    void* data,
    int count,
    MPI_Datatype datatype,
    int source,
    int tag,
    MPI_Comm communicator,
    MPI_Status* status)

count代表的是最大接收长度(#elements). 真实的长度可以用MPI_Get_count拿到.

image

ring那个例子很有趣,实现了0 -> 1 -> ... -> n - 1 -> 0这样的发送逻辑. image

image

Dynamic Length

另外MPI_Status用来实现动态的数据长度发送. 用MPI_Get_count获得真实的received长度. 有两种方法:

jasperzhong commented 3 years ago

Collective Communication

Collective Communication是同步操作——所有进程都执行到这个操作,才会开始执行.

One of the things to remember about collective communication is that it implies a synchronization point among processes. This means that all processes must reach a point in their code before they can all begin executing again.

MPI专门有一个Sync API: MPI_Barrier

int MPI_Barrier(MPI_Comm comm)

mpich官方文档对此的解释是

Blocks the caller until all processes in the communicator have called it; that is, the call returns at any process only after all members of the communicator have entered the call.

MPI_Barrier示例图: image

MPI_Barrier的一种最简单的实现方式是通过ring-like的方式实现.

Broadcast

函数定义.

int MPI_Bcast( void *buffer, int count, MPI_Datatype datatype, int root, 
               MPI_Comm comm )

需要有一个root process——发送数据的,其他process是接收数据的.

最简单的实现是root向其他non-root process一个一个发送,但这样是非常低效的——因为其他process接收到数据后,可以做转发!记作my_cast.

所以一个高效的实现是tree-based的,如下图所示: image

  1. stage 1: 0 -> 1
  2. stage 2: 0 -> 2, 1 -> 3
  3. stage 3: 0 -> 4, 1 -> 5, 2 -> 6, 3 -> 7

每个stage参与扩散的进程数量翻倍!MPI_Bcast的时间复杂度是O(log N)而不是O(N)!

还有一种实现方法是scatter then allgather(对于长消息)

1亿个integers做broadcast,重复10次实验,取平均时间. 对比my_cast和MPI_Bcast性能,结果如下表:

Processors my_cast MPI_BCast
2 0.072 0.066
4 0.256 0.187
8 0.760 0.576
16 1.780 0.815
32 3.547 1.295
64 8.098 2.864

可视化一下. MPI_Bcast的scalability明显更好. image

jasperzhong commented 3 years ago

Scatter, Gather, Allgather

MPI_Scatter 函数定义. 注意sendcount是per process的,不是总共的. 比如sendcount是1,那么进程0拿到第一个element,进程1拿到第二个element. 如果sendcount是2,那么进程0拿到第一个和第二个elements,进程1拿到第三个和第四个elements,以此类推.

int MPI_Scatter(const void *sendbuf, int sendcount, MPI_Datatype sendtype,
                void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)

MPI_Gather

函数定义. 注意这里的recvcount和MPI_Scatter中的sendcount类似,也是per process的数量,而不是总共的数量.

int MPI_Gather(const void *sendbuf, int sendcount, MPI_Datatype sendtype,
               void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)

并行avg这个例子不错: root进程先把数据scatter到各个进程上,然后每个进程算各自block的平均值,然后root进程gather各个进程的local avg,最后再得到global avg.

流程:scatter -> gather. 听起来怎么很熟悉?好像在哪里见过.

image

MPI_Allgather 目前接触到的collective communication (bcast, scatter, gather)都是 many-to-one或者one-to-many的communication. 而many-to-many的communication也很常用.

Allgather = Gather + Bcast

示意图 image

函数定义. 和MPI_Gather很像,但是没有root进程.

int MPI_Allgather(const void *sendbuf, int sendcount, MPI_Datatype sendtype,
                  void *recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm)

image

最后parallel rank例子是个不错的练习. 流程是: gather -> rank -> scatter. image

jasperzhong commented 3 years ago

Reduce, All-Reduce

MPI_Reduce 函数定义. MPI_Op是reduce操作,常见的有MIN, MAX, SUM, PROD等等.

int MPI_Reduce(const void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype,
               MPI_Op op, int root, MPI_Comm comm)

用MPI_Reduce求global sum的例子. image

MPI_Allreduce 哈哈,终于到这个函数了,老熟了. Allreduce = Reduce + Bcast.

函数定义. 和MPI_Reduce基本一样,除了没有root.

int MPI_Allreduce(const void *sendbuf, void *recvbuf, int count,
                  MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)

举的例子是算标准差. 需要一次All-Reduce算mean和一次Reduce求和sq diff. image

jasperzhong commented 3 years ago

Groups and Communicators

MPI_Comm_split 创建communicator可以通过MPI_Comm_split函数

int MPI_Comm_split(MPI_Comm comm, int color, int key, MPI_Comm *newcomm)

同color的会被归为一个communicator. 其中color是非负的或者是MPI_UNDEFINED(代表不被分配new communicator). key决定了rank分配.

一种用法 image

image

MPI_Comm_create_group

另一种方法更直接一些. 用MPI_Group来创建. MPI_Group和MPI_Comm的关系是: MPI_Comm = id + MPI_Group.

函数定义. 可以用MPI_Group直接创建一个MPI_Comm. 很方便.

int MPI_Comm_create_group(MPI_Comm comm, MPI_Group group, int tag, MPI_Comm * newcomm)

还需要用到一个函数MPI_Group_incl. 根据ranks从group里面选择进程作为新的group.

int MPI_Group_incl(MPI_Group group, int n, const int ranks[], MPI_Group * newgroup)

例子举的是创建一个素数rank的group. image

jasperzhong commented 3 years ago

漏了一个重要的内容—— mpirun

mpirun

一般的用法就是

mpirun -np <number of processes> <program name and arguments> 

从文档来看,n/np都可以. image

除此之外,还可以带env参数. 分global env和local env两种:

另外常用的是host file

也可以直接写 -hosts ...

mpirun居然还有checkpoint?? image

https://www.mpich.org/static/downloads/3.3.1/mpich-3.3.1-userguide.pdf 看了下文档. 看上去是把一个node上的所有进程的状态保存成一个文件.