uwhpsc-2016 / lectures

Notes, slides, and code from the in-class lectures.
7 stars 21 forks source link

OpenMP vs. MPI vs. Hadoop(Map Reduce) #27

Open anujsgithub opened 8 years ago

anujsgithub commented 8 years ago

Would you give some insight in which technology to use for what problems between the three OpenMP vs. MPI vs. Hadoop It seems that hadoop provide an automated way to distribute the data in distributed memory thus saving us some time to distribute the data as done in MPI. Does MPI also provide a way to store your files such that they are chunked and stored in different memory blocks? Is it possible to use OpenMP and MPI in the same code (How are we going to compile that). Would it be beneficial so that different processes run in different servers (assuming we have a cluster of servers) and then within each server we run OpenMP to use multiple cores, thus saving the MPI overhead. Or would it be better to use different communicators within a single server. Can you actually define what processes to run in which machines?

cswiercz commented 8 years ago

Would you give some insight in which technology to use for what problems between the three OpenMP vs. MPI vs. Hadoop

First of all, OpenMP is exclusively for shared memory environments. MPI and Hadoop were designed with distributed memory environments in mind. (Though, they both work on a single, shared memory machine.)

MPI stands for "Message Passing Interface" and, in a way, that's really the only thing that it has to offer. At the basic level it offers MPI_Send() and MPI_Receive(). Many of the other MPI functions can be implemented using these two and I recommend trying as an exercise.

Hadoop was specifically designed with data in mind, in particular, with the Map-Reduce operation and distributed file storage / management in mind. This is much more specific operation than the generic "send a message from one machine to another" and therefore can be optimized. One primary difference between this and MPI is that the Hadoop developers took a lot of time to make sure that their operations are fault-tolerant. MPI's approach leans more towards the "trust that the developer knows what they're doing" end of things.

(I don't have much personal experience with Hadoop so this is perhaps where I should stop.)

It seems that hadoop provide an automated way to distribute the data in distributed memory thus saving us some time to distribute the data as done in MPI.

Hadoop's specialty is data and was designed with this in mind. MPI is a general purpose message passing system. You can definitely implement something like Map-Reduce in MPI but, most likely since Hadoop focuses on this operation, Hadoop will have better performance than whatever we might be able to cook up in an afternoon.

Still, after this class is over I might want to spend some time trying that out!

Does MPI also provide a way to store your files such that they are chunked and stored in different memory blocks?

I'm unaware of any "built-in" MPI file management tools so I believe such things need to be done by the developer. Each message passing involves a contiguous data buffer, though, multiple disjoint buffers can be sent. Each process would have to manage some sort of file buffer / pointer and periodically perform read/writes to disk.

There's a whole world of optimization considerations, here, since disk access is about x1000 slower than RAM access.

Is it possible to use OpenMP and MPI in the same code (How are we going to compile that). Would it be beneficial so that different processes run in different servers (assuming we have a cluster of servers) and then within each server we run OpenMP to use multiple cores, thus saving the MPI overhead. Or would it be better to use different communicators within a single server. Can you actually define what processes to run in which machines?

Yes!

Compilation: I haven't actually tried it myself, yet, but I think you just need to include the OpenMP header and compile via

$ mpicc -fopenmp [args]

mpicc passes many of its flags and compiler directives to gcc / clang, anyway.

The overhead for creating and communicating between processes is larger than that of threads. If each MPI process is running on a single shared memory environment (e.g. each process runs on a different machine on the network and each machine as 16 cores, say) then that MPI process can spawn multiple threads via OpenMP.

Some care needs to be taken on which thread is allowed to execute MPI communication calls. Typically, one would want to separate the multi-threaded code from the communication code and only have the master thread (pre-FORK or post-JOIN) perform things like MPI_Send().

I can't think of any reason why you want to use MPI instead of OpenMP on a single machine since most machines are shared memory environments. The code we're writing in class is mostly for demonstration purposes since (a) there is additional setup involved in communicating with multiple machines and (b) not everyone has access to a network cluster.

Here is a talk I found on the OpenMP website about the topic: http://openmp.org/sc13/HybridPP_Slides.pdf

anujsgithub commented 8 years ago

Thanks!