Closed smaniu closed 6 years ago
Some benchmarks on the Cover Type dataset (54 features, sparse), for k=10 clusters, using 1000 microclusters and an initial buffer of 10,000. The generator was the socket streamer.
First experiment: vary the batch window, keep the generator at 1000 instance/sec theoretical, which is around 500-600 /sec in practice. Results:
window | batch time | max delay | k-means time |
---|---|---|---|
1s | 4s | 1h | 2s |
5s | 16s | 37m | 2s |
10s | 43s | 1.5h | 2s |
30s | 2.3m | 2h | 2s |
60s | 3.5m | 1.8h | 2s |
It can be seen that increasing the window does not help. Bigger window means higher processing time so the delay is still there. This suggests that the batch time is due to the data size.
Second experiment: vary the throughput in the socket, keep window at 10seconds. Results:
thoughput | batch time | max delay | k-means time |
---|---|---|---|
100/sec | 9s | 0 | 1s |
250/sec | 19s | 32m | 2s |
500/sec | 27s | 46m | 2s |
1000/sec | 43s | 1.5h | 2s |
After all, it seems that the k-means type is not such a big factor - it is likely that the update of microclusters has an issue (most likely, cluster assignment).
Currently, CluStream does a k-means from the microclusters, to generate the final clusters. In the current implementation, this is done in every batch, in the
train
loop.This is normal, if we want to evaluate the clusters which are generated by CluStream. However, since k-means is set by default to iterate 1000 times, this generates a scheduling delay in the processing of batches.
To give an example, for a synthetic dataset of 3 features, dense, a socket stream sending ~600 records/sec, and a window of 10 seconds:
My solution would be to process k-means "lazily"; that is, to only process it when we need an assignment to evaluate, in
assign
. This makes sense for two reasons:train
loop).I suspect the same observation might hold for StreamKM++, although I'm not sure.