Where to use OpenMP / MPI?

polisquad / mpi-project

Implementation of the K-means algorithm with OpenMP and MPI

GNU General Public License v3.0

0 stars 1 forks source link

Where to use OpenMP / MPI? #3

Open kjossul opened 5 years ago

kjossul commented 5 years ago

Proposed hybrid algorithm, inspired from this presentation:

perform pre-processing and set initial cluster points (See issue #1 )
assign N/P data points to each machine
while not converged:
    broadcast current centroids to all machines
    on each machine: <-- MPI
         compute membership of each point <-- OpenMP
         broadcast membership vector
    compute new centroids based on membership vectors <-- OpenMP

sneppy commented 5 years ago

Timings using an implementation of the proposed algorithm - with numClusters = 24 and datasetSize = 512k:

	MARK-VI (i7-4790k)	MARK-VIII (i7-8550u)	MPI merge
Debug/OpenMP off	29.6 sec	33.0 sec	15.8 sec
Debug/OpenMP on	5.4 sec	7.0 sec	3.5 sec
Release/OpenMP off	2.1 sec	8.0 sec	1.5 sec
Release/OpenMP on	0.6 sec	1.4 sec	0.84 sec

With numClusters=16 and datasetSize=16M:

	MARK-VI	MARK-VIII	MPI merge
Release/OpenMP on	11.5 sec	27.0 sec	14.1 sec

All tests ran 100 epochs.

It would probably be best to run MPI on >4 machines with similar specs.

Results were verified with numClusters=24 and datasetSize=2k:

clusters

lpraat commented 5 years ago

I think this is the best way to approach the problem

(1) set initial cluster points
(2) assign equally data points to each machine
(3) while not converged:
(4)    broadcast current centroids to all machines
(5)    on each machine:
(6)         compute membership of each point <-- OpenMP
(7)         compute local centroid(calculating local mean for each cluster) according 
                             to local membership view <-- OpenMP
(8)    compute global centroids(just sum the local means received 
                          from each machine for each cluster)

This way we are speeding up the second phase of the algorithm by having it run on each machine and computing the global centroids for the next iteration is just a matter of doing a sum of K*M terms, where K is the number of clusters and M is the number of machines.

[phase (8) can also be done by each single machine and in this case we broadcast the local means at each iteration after phase (7), and phase(4) is done just once at the start]

sneppy commented 5 years ago

I'm writing an improved algorithm, with a better use of OMP. I also integrated MPI_Bcast in MPI::Device to broadcast messages to all other nodes (vs MPI_Send in a loop).

On MARK-VI, dataset = 512k points, numClusters = 24, epochs = 100 it times ~ 0.52 seconds, which is slightly better than 0.6 seconds of previous code

708d7f1091509fdaca52dd5303f46986ed9d1fcc