mims-harvard / nimfa

Nimfa: Nonnegative matrix factorization in Python
http://ai.stanford.edu/~marinka/nimfa/
Other
535 stars 135 forks source link

Which method is a good balance between speed and accuracy? #5

Closed conradlee closed 12 years ago

conradlee commented 12 years ago

This is not the correct place for this question, but as this project has no mailing list, I don't know where else to post it.

I need to factor some rather large matrices (300,000 examples and 30,000 features). I am wondering what method will be able to do so in a reasonable amount of time.

As the author of this library, you must be familiar with a large range of NMF methods. In your experience, which of your implementations is quick while at the same not not making big sacrifices in terms of quality? I know that this question will depend on the dataset and on how I define "quality," but do you have any favorite methods for larger matrices?

Best regards,

Conrad

marinkaz commented 12 years ago

Hi Conrad,

thanks for the question. The main concern of using the MF library for factorization of matrices of shapes in the range you specified is the Python language. Based on my experience, efficient implementation of these algorithms in c or c++ would perform in reasonable amount of time (measuring in minutes per iterations). Unless the target matrix is sparse there will be some issues with the initialization of numpy matrices which are used internally in the library. Here and here is some unofficial experience people have with large matrices in numpy.

For example, in this year KDD Cup (the task was music recommendation) I factorized a matrix of size 1e6 rows by 6e5 columns. Each iteration of factorization with similar performance complexity as NMF, LSNMF, SNMF, PMF took approx 2 min on average laptop (!); using grid was even faster. However, the code was c++ and some special measures concerning space consumption have been implemented, so that the whole data set fitted in the RAM. Another group tried performing factorization in Python and one iteration of the same factorization algorithm with some implementation tricks took approx 30 min on a grid (!) while consuming cca 12 GB of RAM.

To conclude, what I wanted to stress here are two concerns. First, there is the Python language itself. Despite being user friendly it is usually more time and space consuming than c or c++ implementation of the algorithm. To approximate the c code performance many tricks must be implemented. However these tricks are usually application specific. And here is my second concern. Since the MF is general library for NMF factorizations which tries to address various fields and applications, such tricks cannot be employed.

To answer to your question, depending on your computer performance there is a possibility that you will have to divide the data set and perform separate factorizations and then join the results (this is possible in some applications, such in KDD Cup this year; results from separate factorizations were competitive with those from united data set). Perhaps less computationally complex methods which have good pericision-complexity tradeoff are NMF, PMF, BMF, NSNMF, SNMF, LSNMF. Bayesian decompositions are more complex and PSMF as well (but give in certain applications much better results). However, again this is subjective experience (if published by author, there is some info on method complexity in specific method's docs) and application specific.

Hope this could be of some use.

Best,

Marinka