Clustream k-means processing time

huawei-noah / streamDM

Stream Data Mining Library for Spark Streaming

Apache License 2.0

492 stars 147 forks source link

Currently, CluStream does a k-means from the microclusters, to generate the final clusters. In the current implementation, this is done in every batch, in the train loop.

This is normal, if we want to evaluate the clusters which are generated by CluStream. However, since k-means is set by default to iterate 1000 times, this generates a scheduling delay in the processing of batches.

To give an example, for a synthetic dataset of 3 features, dense, a socket stream sending ~600 records/sec, and a window of 10 seconds:

each batch is processed between 15 to 20 seconds,
this, in turn, generates at least 10 seconds of delay between the batches, which, obviously, adds up.

My solution would be to process k-means "lazily"; that is, to only process it when we need an assignment to evaluate, in assign. This makes sense for two reasons:

normally, assignment would only be on demand, and not in every loop,
since assignment is a different stream thread, it can be parallelised (i.e., performed at the same time as the train loop).

I suspect the same observation might hold for StreamKM++, although I'm not sure.

window	batch time	max delay	k-means time
1s	4s	1h	2s
5s	16s	37m	2s
10s	43s	1.5h	2s
30s	2.3m	2h	2s
60s	3.5m	1.8h	2s

window

batch time

max delay

k-means time

16s

37m

10s

43s

1.5h

30s

2.3m

60s

3.5m

1.8h

thoughput	batch time	max delay	k-means time
100/sec	9s	0	1s
250/sec	19s	32m	2s
500/sec	27s	46m	2s
1000/sec	43s	1.5h	2s

thoughput

batch time

max delay

k-means time

100/sec

250/sec

19s

32m

500/sec

27s

46m

1000/sec

43s

1.5h

huawei-noah / streamDM

Clustream k-means processing time #20