The scikit-learn k-means implementation does not seem to store the meta-data of the documents, but making use of the fit_predict method it is possible to obtain the indices of tweets (in the preloaded tweets list which is passed to the KMean object) that belong to a certain cluster.
When the dataset is large enough this will cause memory issue. To avoid having to preloading everything into memory, a database needs to support random access of multiple documents in one (or not too many) go. Normal DBMS does not seem to support that.
The scikit-learn k-means implementation does not seem to store the meta-data of the documents, but making use of the fit_predict method it is possible to obtain the indices of tweets (in the preloaded tweets list which is passed to the KMean object) that belong to a certain cluster.
When the dataset is large enough this will cause memory issue. To avoid having to preloading everything into memory, a database needs to support random access of multiple documents in one (or not too many) go. Normal DBMS does not seem to support that.
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.fit_predict