Improve memory usage of tweets retrieval from the database

The scikit-learn k-means implementation does not seem to store the meta-data of the documents, but making use of the fit_predict method it is possible to obtain the indices of tweets (in the preloaded tweets list which is passed to the KMean object) that belong to a certain cluster.

When the dataset is large enough this will cause memory issue. To avoid having to preloading everything into memory, a database needs to support random access of multiple documents in one (or not too many) go. Normal DBMS does not seem to support that.

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.fit_predict

vincentfung13 / TwitterRepManagement

Improve memory usage of tweets retrieval from the database #3