vincentfung13 / TwitterRepManagement

An application that allows you to monitor online reputation over Twitter.
MIT License
2 stars 0 forks source link

Improve memory usage of tweets retrieval from the database #3

Open vincentfung13 opened 8 years ago

vincentfung13 commented 8 years ago

The scikit-learn k-means implementation does not seem to store the meta-data of the documents, but making use of the fit_predict method it is possible to obtain the indices of tweets (in the preloaded tweets list which is passed to the KMean object) that belong to a certain cluster.

When the dataset is large enough this will cause memory issue. To avoid having to preloading everything into memory, a database needs to support random access of multiple documents in one (or not too many) go. Normal DBMS does not seem to support that.

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.fit_predict

vincentfung13 commented 8 years ago

Optimisation of database usage: https://docs.djangoproject.com/en/1.9/topics/db/optimization/

QuerySetAPI Documentation: https://docs.djangoproject.com/en/1.9/ref/models/querysets/