Closed TomDLT closed 7 years ago
Thank you for this great summary. Your evaluation doesn't include greedy selection, right? @mblondel any reason you didn't implement that?
What are other state-of-the-art implementations?
Also, I'd like to see how the methods perform on different datasets. I think sparse data is more common with NMF, but it would be good to have some results on dense data, too. Maybe MNIST, maybe audio? I'm not sure what good datasets are. There are two face datasets that are used in the papers. Can we use them? Or maybe lfw or olivetti?
Btw, it seems like #1348 stalled because no-one did a "convincing" empirical comparison. I think #896 was not included yet because there was not time to do so. #2540 is missing an mini-batch version, I think (didn't look in detail, though).,
Scattered thoughts:
We could try to pick up #2557 and replace the projected gradient solver with the one from this notebook that solves the regularization/sparseness awkwardness, to give the PG-NMF method a fair trial, and see if CD consistently outperforms it (for tall & wide, dense & sparse data)
Could you explain the bit about projected gradient scaling worse when KL divergence loss is used?
I'm a bit worried that the different inits lead to such different results. I wonder if the solutions are all equally "good". Maybe we could stick a classifier after it?
I agree with @amueller we should bench this on text (20newsgroups) and faces or MNIST.
Ok, I am working on more benchmarking, including dense and large datasets.
Could you explain the bit about projected gradient scaling worse when KL divergence loss is used?
I read it in this paper, but it is not very detailed. Projected gradient's original paper does not deal with KL divergence since it is not well defined in 0. Actually, I did not get into the details, but as far as I have searched, I did not find other interesting papers talking about a projected gradient method with KL-divergence.
It seems actually that the Yang et al paper says that projected gradient has problems with I-divergence (generalized KL?) but not with normalized KL, and (quoting from page 7)
By contrast, [KL-Armijo] using KL-divergence and projected gradient descent can find even better objectives. This is probably because monotonicity guaranteed by multiplicative updates may not imply the full KKT condition.
Is I-divergence really that slow to converge with gradient-based methods? Maybe we should try. @mblondel do you have any intuition?
this might also be relevant: http://www.jmlr.org/proceedings/papers/v28/kumar13b.pdf they solve it exactly under certain conditions.
Also, using ADMM should work fairly well and is easy to implement. It should be much faster than projected gradient descent. via @bmcfee
Also, using ADMM should work fairly well and is easy to implement. It should be much faster than projected gradient descent. via @bmcfee
In my lab, we have had very bad experiences with ADMM: its convergence is sensitive to the rho parameter, and sometimes it does not converge (the acceptable values of rho typically depend on the input data).
You need to dynamically scale the rho. Maybe you just need an ADMM expert ;)
I have no idea, I'm just channeling @bmcfee and adding snarkiness btw.
You need to dynamically scale the rho. Maybe you just need an ADMM expert ;)
That didn't work for us. On two different problems (covariance estimation and linear models with analysis sparsity). We have studied closely the corresponding Boyd paper, and tried the various tricks. And I have heard some pretty famous machine learning and optimization researchers say the same thing as I do.
We almost got it to work: it worked 99% of the time. But the remaining 1% of the time do not give rise to a failure that is easily identified. So we couldn't trust it.
Well, tell @bmcfee that the experts hide their knowledge well enough to make sure that noone benefits from it :).
But rumors do say that setting rho in an ADMM can be a problem.
Did you use my implementation for CD? Ideally it needs to be Cythonified to obtain the best performance. Currently, it uses Numba.
Did you use the same initialization for each method? I am surprised that CD seems to get stuck in a worse local optimum than other methods. Or maybe this is just the scale of your plot. Could you share your script?
Are the regularization parameters the same? Our projected gradient implementation has a different interface. BTW, we also need to decide whether we want to simplify the interface (like in my implementation).
Use the entire news20 dataset. As the dataset grows, I would expect CD to win.
ADMM is nice for complex regularizers. Non-negativity constraints is pretty easy to handle without resorting to ADMM.
my colleague recommends you have a look at:
http://arxiv.org/pdf/1401.5226v2.pdf
it contains benchmarks of NMF solvers on various problems.
More benchmarking and more details
I will change regularization of PG-NMF, using this notebook, and do more benchmarking with regularization.
20 news Faces
It's nice to see that Numba is on par with Cython. The Cython version will be useful in any case since we cannot depend on Numba yet.
yeah, that looks pretty good. That is all for MSE, right? Do we also want KL?
Any idea why there's no more obvious difference in the loss attained by PG and CD? Based on these results I would just keep the CD solver. I would really like this to be the case, it would make everything so much simpler.
Adding regularization might give us more insight. Another dimension not varied across these plots is n_components
.
API questions about regularization:
l1_reg
and l2_reg
, or elastic net style alpha
, l1_ratio
?l*_reg_components
/ l*_reg_transformation
(better name ideas?)EDIT: or, easier to use and gridsearch but less expressive: alpha, l1_ratio, regularization={'tied', 'components', 'transformation'}
With KL we can reach audio people
I wouldn't mind KL (I-divergence?) for text either, since it corresponds to pLSI. But frobenius seems to lead to good topics too.
There is at least one state-of-the-art NMF algorithm that hasn't been discussed yet: Kim and Park's Block Principle Pivoting method. https://3e4b2411-a-62cb3a1a-s-sites.googlegroups.com/site/jingukim/2011_paper_sisc_nmf.pdf
This is a block coordinate descent approach for solving the alternating nonnegative least squares (ANLS) subproblems under the Frobenius norm. It basically speeds up the old active-set methods. It is straight-forward to implement, and in their paper Kim and Park demonstrate, on a variety of datasets (faces and text), that it performs better, both in convergence time and relative reconstruction error, than the MU and PG algorithms already mentioned in this discussion, and about the same as the CD method already mentioned. Interestingly, when used to solve the ANLS subproblems in nonnegative tensor factorization, this algorithm performs better than the CD method (see https://3e4b2411-a-62cb3a1a-s-sites.googlegroups.com/site/jingukim/2011_paper_hpscbook_ntf.pdf).
Python code for this algorithm already exists, although not in the scikit-learn API style: https://github.com/kimjingu/nonnegfac-python.
Definitely worth considering.
Interesting. As scikit-learn doesn't deal with tensors, that sounds like further support of the CD solver :)
Seems to me it argues for BPP since the two perform similarly on NMF, and when scikit-learn adds tensor support, the state-of-the-art ANLS solver for NTF will already be implemented. However, to be fair, it looks to me like the comparison done in the paper is to an unaccelerated version of the proposed CD solver (see http://download.springer.com/static/pdf/974/art%253A10.1007%252Fs10898-013-0035-4.pdf). So for proper comparison it would need to be added to @TomDLT 's tests.
when scikit-learn adds tensor support
There is currently no plan for that to ever happen. That's what I meant, no reason to go for the fancier newer method if we'll never even need its benefits.
I would just keep the CD solver. It would make everything so much simpler.
This is also my feeling. Having too many solvers makes things harder in case of refactoring. For instance, adding proper fit_intercept support in Ridge is harder because we have too many solvers to change. The CD code is also pretty simple and has a good stopping criterion.
Do we also want KL?
We definitely do but let's do it one PR at a time. For now let's focus on the squared loss and the API :) BTW, it's called generalized KL divergence or I divergence. The KL divergence is different. Many people (even in papers) make the confusion. Don't ask me why it's called generalized ;) Maybe @TomDLT can volunteer to write a CD solver later :)
API questions about regularization
Maybe elastic net style for consistency with other modules? However, even only L2 regularization will lead to sparse solutions in practice. Regarding factor specific regularization, this sounds complicated. Maybe we can keep the class API simple and expose a function with more options?
Thanks @mblondel! Just to be sure: do you have some intuition about whether CD shouldn't work with I-divergence? The paper cited by @TomDLT earlier claims that PG wouldn't work (well), but the claim doesn't seem well backed.
Could you quote the relevant sentence? One possible reason would be because the I-divergence doesn't have Lipschitz continuous first derivative (i.e. its second derivative is not bounded above). Most gradient-based solvers assume Lipschitz continuous gradient in their convergence proof.
However, for CD, the authors were still able to prove convergence to a stationary point: http://www.cs.utexas.edu/%7Ecjhsieh/nmf_kdd11.pdf
Although a number of projected gradient algorithms [...] have been proposed for NMF based on least squared errors, the speedup advantage is more difficult to obtain for the approximation based on the I-divergence. The major reason is that the gradients of I-NMF [...] heavily depend on the scaling of W and H. [...] The badly scaled gradients also make the second order optimization methods such as Newton or quasi-Newton algorithms ill-posed and even fail to converge, as shown in Section 4
From here. But the results in section 4 don't seem to me to really support the very strong language used above.
So am I reading this right, and what the kdd'11 paper calls KL-divergence (equation 3) is actually Generalized KL-divergence or I-divergence, right?
side-note: both of those are horrible names. What does the I in I-divergence stand for?
This claims that (paraphrase) "it's generalized in the sense that KL-Div(a, b) = I-Div(a, b) if a and b are probability distributions." I-Div is also a Bregman divergence.
what the kdd'11 paper calls KL-divergence (equation 3) is actually Generalized KL-divergence or I-divergence, right?
Yes
After talking with @agramfort and various audio colleagues, it appears that:
I suggest to implement both MM method (in order to have easily different losses) and CD method (which seems a very efficient method for the Frobenius norm, as we have seen in the benchmarks).
In term of API, @vene and @mblondel suggested elastic net style (with alpha
and l1_ratio
), with the same regularization for W and H in the class. The class will call a function with more options to regularize either W, H or both.
Should we have several classes (MultiplicativeNMF, ProjectedGradientNMF, ...), or one NMF class and several solvers?
How should we deal with current solver? with current regularization trick?
I vote for one NMF class, different solvers. Deprecate the current PG solver and the parameters associated to it? It would make for a lot of parameter duplication for the next releases but it shouldn't be that big a deal, right?
Should we have several classes (MultiplicativeNMF, ProjectedGradientNMF, ...), or one NMF class and several solvers?
I think we are avoiding naming classes after solvers. One NMF
class that delegates to different solver functions might work.
How should we deal with current solver? with current regularization trick?
Sorry, I'm not exactly sure what you mean by this. If we implement CD there's no reason to keep PG unless benchmarks back it up in certain settings.
It would make for a lot of parameter duplication
Alternatively we could name the new class NonnegativeFactorization
and deprecate NMF
.
Does the maximization-minimization method support elastic-net penalty? Eqs. 73-74 give a solution for L1-penalty. If not, we could restrict the API to penalizing just with L1, and support the fully-expressive non-tied elastic net formulation in a cd_nmf
function. @mblondel noted that L2 regularization + non-negative constraints will lead to sparse solutions anyway, maybe the L2 penalty converges faster?
There is also: The Diagonalized Newton Algorithm for Nonnegative Matrix Factorization by Hugo Van hamme (http://arxiv.org/abs/1301.3389) mentioned by @vene.
In my comment I claimed that this algorithm was shown to outperform CD. It doesn't look like that is accurate, the comparisons are only against multiplicative updates. No idea why I said that. Maybe I thought CD meant something else? Sorry!
Should we close as #4852 was merged? Or keep open to discuss adding more losses?
Based on this discussion, I have worked on adding a multiplicative-update solver for NMF (more precisely Maximization-Minimization) in PR #5295. It handles all beta-divergences, including Frobenius norm, generalized Kullback-Leibler divergence and Itakura-Saito divergence.
Closing this since #4852 and #5295 are merged.
NMF in scikit-learn overview :
Current implementation (code):
#1348 (or in gist):
#2540 [WIP]:
#896 (or in gist):
Papers describing the methods:
About the uniqueness of the results The problem is non-convex, and there is no unique minimum: Different losses, different initializations, and/or different optimization methods generally give different results !
About the methods
About the initialization Different schemes exist, and can change significantly both result and speed. They can be used independantly for each NMF method.
About the stopping condition Actual stopping condition in PG-NMF is bugged (#2557), and leads to poor minima when the tolerance is not low enough, especially in the random initialization scheme. It is also completely different from stopping condition in MU-NMF, which is very difficult to set. Talking with audio scientists (who use a lot MU-NMF for source seperation) reveals that they just set a number of iteration.
As far as I understand NMF, as there is no unique minimum, there is no perfect loss/method/initialization/regularization. A good choice for some dataset can be terrible for another one. I don't know how many methods we want to maintain in scikit-learn, and how much we want to guide users with few possibilities, but several methods seems more useful than only one.
I have tested MU-NMF, PG-NMF and CD-NMF from scikit-learn code, #2540 and #896, with squared loss and no regularization, on a subsample of 20news dataset, and performances are already very different depending on the initialization (see below).
Which methods do we want in scikit-learn? Why do we have stopped #1348 or #896 ? Do we want to continue #2540 ? I can work on it as soon as we have decided.
NNDSVD (similar curves than NNDSVRAR) NNDSVDA Random run 1 Random run 2 Random run 3