huawei-noah / streamDM

Stream Data Mining Library for Spark Streaming
http://streamdm.noahlab.com.hk/
Apache License 2.0
492 stars 147 forks source link

Clustream k-means processing time #20

Closed smaniu closed 6 years ago

smaniu commented 9 years ago

Currently, CluStream does a k-means from the microclusters, to generate the final clusters. In the current implementation, this is done in every batch, in the train loop.

This is normal, if we want to evaluate the clusters which are generated by CluStream. However, since k-means is set by default to iterate 1000 times, this generates a scheduling delay in the processing of batches.

To give an example, for a synthetic dataset of 3 features, dense, a socket stream sending ~600 records/sec, and a window of 10 seconds:

My solution would be to process k-means "lazily"; that is, to only process it when we need an assignment to evaluate, in assign. This makes sense for two reasons:

I suspect the same observation might hold for StreamKM++, although I'm not sure.

smaniu commented 9 years ago

Some benchmarks on the Cover Type dataset (54 features, sparse), for k=10 clusters, using 1000 microclusters and an initial buffer of 10,000. The generator was the socket streamer.

First experiment: vary the batch window, keep the generator at 1000 instance/sec theoretical, which is around 500-600 /sec in practice. Results:

window batch time max delay k-means time
1s 4s 1h 2s
5s 16s 37m 2s
10s 43s 1.5h 2s
30s 2.3m 2h 2s
60s 3.5m 1.8h 2s

It can be seen that increasing the window does not help. Bigger window means higher processing time so the delay is still there. This suggests that the batch time is due to the data size.

Second experiment: vary the throughput in the socket, keep window at 10seconds. Results:

thoughput batch time max delay k-means time
100/sec 9s 0 1s
250/sec 19s 32m 2s
500/sec 27s 46m 2s
1000/sec 43s 1.5h 2s

After all, it seems that the k-means type is not such a big factor - it is likely that the update of microclusters has an issue (most likely, cluster assignment).