Which method is a good balance between speed and accuracy?

Hi Conrad,

thanks for the question. The main concern of using the MF library for factorization of matrices of shapes in the range you specified is the Python language. Based on my experience, efficient implementation of these algorithms in c or c++ would perform in reasonable amount of time (measuring in minutes per iterations). Unless the target matrix is sparse there will be some issues with the initialization of numpy matrices which are used internally in the library. Here and here is some unofficial experience people have with large matrices in numpy.

For example, in this year KDD Cup (the task was music recommendation) I factorized a matrix of size 1e6 rows by 6e5 columns. Each iteration of factorization with similar performance complexity as NMF, LSNMF, SNMF, PMF took approx 2 min on average laptop (!); using grid was even faster. However, the code was c++ and some special measures concerning space consumption have been implemented, so that the whole data set fitted in the RAM. Another group tried performing factorization in Python and one iteration of the same factorization algorithm with some implementation tricks took approx 30 min on a grid (!) while consuming cca 12 GB of RAM.

To conclude, what I wanted to stress here are two concerns. First, there is the Python language itself. Despite being user friendly it is usually more time and space consuming than c or c++ implementation of the algorithm. To approximate the c code performance many tricks must be implemented. However these tricks are usually application specific. And here is my second concern. Since the MF is general library for NMF factorizations which tries to address various fields and applications, such tricks cannot be employed.

To answer to your question, depending on your computer performance there is a possibility that you will have to divide the data set and perform separate factorizations and then join the results (this is possible in some applications, such in KDD Cup this year; results from separate factorizations were competitive with those from united data set). Perhaps less computationally complex methods which have good pericision-complexity tradeoff are NMF, PMF, BMF, NSNMF, SNMF, LSNMF. Bayesian decompositions are more complex and PSMF as well (but give in certain applications much better results). However, again this is subjective experience (if published by author, there is some info on method complexity in specific method's docs) and application specific.

Hope this could be of some use.

Best,

Marinka

mims-harvard / nimfa

Which method is a good balance between speed and accuracy? #5