Implement different convergence conditions and compare them

polisquad / mpi-project

Implementation of the K-means algorithm with OpenMP and MPI

GNU General Public License v3.0

0 stars 1 forks source link

Implement different convergence conditions and compare them #2

Open kjossul opened 5 years ago

kjossul commented 5 years ago

Proposed:

A maximum number of iterations has been performed.
Fewer than a threshold percentage of data points are reassigned during an iteration.
All means migrate less than a threshold distance during an update cycle.

sneppy commented 5 years ago

Assigning data points at each iteration is time consuming and not necessary, thus we should implement a combination of 1 and 3.

Data points can be assigned at the end of the algorithm.

kjossul commented 5 years ago

Why do you say so? Producing assignments has same time complexity as updating the clusters. I think we are misunderstanding each other because when I speak about assignments I mean local assignments (as in the vectors that slaves return to master in MPI, as described here) and you mean global assignments.

sneppy commented 5 years ago

Tracking the membership of each data point either requires a vector to store the index of each data point or K vectors to directly store the data points assigned to each cluster. In addition, the membership updates must be broadcast to all other hosts, and when you have N hosts and 1 million data point which occupy 8 bytes each (the case of a 2d float vector) the master node must send 8 Mb of data to each node, at the end of each epoch. We calculate the membership to update the centroid of the cluster but we are not required to share it with other nodes since the algorithm works perfectly fine. Once the centroid of the clusters don't change anymore we can actually calculate the membership of each data point.

kjossul commented 5 years ago

the membership updates must be broadcast to all other hosts, and when you have N hosts and 1 million data point which occupy 8 bytes each (the case of a 2d float vector) the master node must send 8 Mb of data to each node, at the end of each epoch.

How about just an integer that says how many assignments have changed?