Discussion about adding NMF methods

TomDLT commented 9 years ago

NMF in scikit-learn overview :

Current implementation (code):

loss = squared (aka Frobenius norm)
method = projected gradient
regularization = trick to enforce sparseness or low error with beta / eta

#1348 (or in gist):

loss = (generalized) Kullback-Leibler divergence (aka I-divergence)
method = multiplicative update
regularization = None

#2540 [WIP]:

loss = squared (aka Frobenius norm)
method = multiplicative update
regularization = None

#896 (or in gist):

loss = squared (aka Frobenius norm)
method = coordinate descent, no greedy selection
regularization = L1 or L2

Papers describing the methods:

About the uniqueness of the results The problem is non-convex, and there is no unique minimum: Different losses, different initializations, and/or different optimization methods generally give different results !

About the methods

The multiplicative update (MU) is the most widely used because of it's simplicity. It is very easy to adapt it to squared loss, (generalized) Kullback-Leibler divergence or Itakura–Saito divergence, which are 3 specific cases of the so-called beta-divergence. All three losses seem used in practice. A regularization L1 or L2 can easily be added.
The Projected gradient (PG) seems very efficient for the squared loss, but does not scale well (w.r.t X size) for the (generalized) KL divergence. A L1 or L2 regularization could possibly be added in the gradient step. I don't know where the sparseness enforcement trick in current code comes from.
The Coordinate Descent (CD) seems even more efficient for squared loss, and we can add easily L1 or L2 regularization. It can be further speeded up by a greedy selection of coordinate. The adaptation for KL divergence is possible with a Newton method for solving subproblem (slower), but without greedy selection. This adaptation is supposed to be faster than MU-NMF with (generalized) KL divergence.

About the initialization Different schemes exist, and can change significantly both result and speed. They can be used independantly for each NMF method.

About the stopping condition Actual stopping condition in PG-NMF is bugged (#2557), and leads to poor minima when the tolerance is not low enough, especially in the random initialization scheme. It is also completely different from stopping condition in MU-NMF, which is very difficult to set. Talking with audio scientists (who use a lot MU-NMF for source seperation) reveals that they just set a number of iteration.

As far as I understand NMF, as there is no unique minimum, there is no perfect loss/method/initialization/regularization. A good choice for some dataset can be terrible for another one. I don't know how many methods we want to maintain in scikit-learn, and how much we want to guide users with few possibilities, but several methods seems more useful than only one.

I have tested MU-NMF, PG-NMF and CD-NMF from scikit-learn code, #2540 and #896, with squared loss and no regularization, on a subsample of 20news dataset, and performances are already very different depending on the initialization (see below).

Which methods do we want in scikit-learn? Why do we have stopped #1348 or #896 ? Do we want to continue #2540 ? I can work on it as soon as we have decided.

NNDSVD (similar curves than NNDSVRAR) nndsvd NNDSVDA nndsvda Random run 1 random_2 Random run 2 random_3 Random run 3 random_4

amueller commented 9 years ago

Thank you for this great summary. Your evaluation doesn't include greedy selection, right? @mblondel any reason you didn't implement that?

What are other state-of-the-art implementations?

Also, I'd like to see how the methods perform on different datasets. I think sparse data is more common with NMF, but it would be good to have some results on dense data, too. Maybe MNIST, maybe audio? I'm not sure what good datasets are. There are two face datasets that are used in the papers. Can we use them? Or maybe lfw or olivetti?

amueller commented 9 years ago

Btw, it seems like #1348 stalled because no-one did a "convincing" empirical comparison. I think #896 was not included yet because there was not time to do so. #2540 is missing an mini-batch version, I think (didn't look in detail, though).,

vene commented 9 years ago

Scattered thoughts:

We could try to pick up #2557 and replace the projected gradient solver with the one from this notebook that solves the regularization/sparseness awkwardness, to give the PG-NMF method a fair trial, and see if CD consistently outperforms it (for tall & wide, dense & sparse data)

Could you explain the bit about projected gradient scaling worse when KL divergence loss is used?

I'm a bit worried that the different inits lead to such different results. I wonder if the solutions are all equally "good". Maybe we could stick a classifier after it?

I agree with @amueller we should bench this on text (20newsgroups) and faces or MNIST.

TomDLT commented 9 years ago

Ok, I am working on more benchmarking, including dense and large datasets.

Could you explain the bit about projected gradient scaling worse when KL divergence loss is used?

I read it in this paper, but it is not very detailed. Projected gradient's original paper does not deal with KL divergence since it is not well defined in 0. Actually, I did not get into the details, but as far as I have searched, I did not find other interesting papers talking about a projected gradient method with KL-divergence.

vene commented 9 years ago

It seems actually that the Yang et al paper says that projected gradient has problems with I-divergence (generalized KL?) but not with normalized KL, and (quoting from page 7)

By contrast, [KL-Armijo] using KL-divergence and projected gradient descent can find even better objectives. This is probably because monotonicity guaranteed by multiplicative updates may not imply the full KKT condition.

Is I-divergence really that slow to converge with gradient-based methods? Maybe we should try. @mblondel do you have any intuition?

amueller commented 9 years ago

this might also be relevant: http://www.jmlr.org/proceedings/papers/v28/kumar13b.pdf they solve it exactly under certain conditions.

Also, using ADMM should work fairly well and is easy to implement. It should be much faster than projected gradient descent. via @bmcfee

GaelVaroquaux commented 9 years ago

Also, using ADMM should work fairly well and is easy to implement. It should be much faster than projected gradient descent. via @bmcfee

In my lab, we have had very bad experiences with ADMM: its convergence is sensitive to the rho parameter, and sometimes it does not converge (the acceptable values of rho typically depend on the input data).

amueller commented 9 years ago

You need to dynamically scale the rho. Maybe you just need an ADMM expert ;)

amueller commented 9 years ago

I have no idea, I'm just channeling @bmcfee and adding snarkiness btw.

GaelVaroquaux commented 9 years ago

You need to dynamically scale the rho. Maybe you just need an ADMM expert ;)

That didn't work for us. On two different problems (covariance estimation and linear models with analysis sparsity). We have studied closely the corresponding Boyd paper, and tried the various tricks. And I have heard some pretty famous machine learning and optimization researchers say the same thing as I do.

We almost got it to work: it worked 99% of the time. But the remaining 1% of the time do not give rise to a failure that is easily identified. So we couldn't trust it.

GaelVaroquaux commented 9 years ago

Well, tell @bmcfee that the experts hide their knowledge well enough to make sure that noone benefits from it :).

But rumors do say that setting rho in an ADMM can be a problem.

mblondel commented 9 years ago

Did you use my implementation for CD? Ideally it needs to be Cythonified to obtain the best performance. Currently, it uses Numba.

Did you use the same initialization for each method? I am surprised that CD seems to get stuck in a worse local optimum than other methods. Or maybe this is just the scale of your plot. Could you share your script?

Are the regularization parameters the same? Our projected gradient implementation has a different interface. BTW, we also need to decide whether we want to simplify the interface (like in my implementation).

Use the entire news20 dataset. As the dataset grows, I would expect CD to win.

ADMM is nice for complex regularizers. Non-negativity constraints is pretty easy to handle without resorting to ADMM.

agramfort commented 9 years ago

my colleague recommends you have a look at:

http://arxiv.org/pdf/1401.5226v2.pdf

it contains benchmarks of NMF solvers on various problems.

TomDLT commented 9 years ago

More benchmarking and more details

I used CD-NMF implementation of @mblondel. I Cythonized the critical part, but it does not speed up compared to Numba (yet I am not an expert of Cython).
I used MU-NMF implementation of @larsman in #2540
I used the same initialization for each method, and used no regularization of any kind.
I benchmarked on (entire) 20news (sparse) and Olivetti faces (dense) datasets.
I set a very low tolerance to prevent early stops and used different max_iter with no warm starting.

I will change regularization of PG-NMF, using this notebook, and do more benchmarking with regularization.

20 news 20news Faces faces

mblondel commented 9 years ago

It's nice to see that Numba is on par with Cython. The Cython version will be useful in any case since we cannot depend on Numba yet.

amueller commented 9 years ago

yeah, that looks pretty good. That is all for MSE, right? Do we also want KL?

vene commented 9 years ago

Any idea why there's no more obvious difference in the loss attained by PG and CD? Based on these results I would just keep the CD solver. I would really like this to be the case, it would make everything so much simpler.

Adding regularization might give us more insight. Another dimension not varied across these plots is n_components.

API questions about regularization:

Should we expose l1_reg and l2_reg, or elastic net style alpha, l1_ratio?
Should we tie regularization for both factors, or expose separately l*_reg_components / l*_reg_transformation (better name ideas?)

EDIT: or, easier to use and gridsearch but less expressive: alpha, l1_ratio, regularization={'tied', 'components', 'transformation'}

agramfort commented 9 years ago

With KL we can reach audio people

vene commented 9 years ago

I wouldn't mind KL (I-divergence?) for text either, since it corresponds to pLSI. But frobenius seems to lead to good topics too.

EDIT: Another paper on Frobenius NMF for topic modeling

bearnshaw commented 9 years ago

There is at least one state-of-the-art NMF algorithm that hasn't been discussed yet: Kim and Park's Block Principle Pivoting method. https://3e4b2411-a-62cb3a1a-s-sites.googlegroups.com/site/jingukim/2011_paper_sisc_nmf.pdf

This is a block coordinate descent approach for solving the alternating nonnegative least squares (ANLS) subproblems under the Frobenius norm. It basically speeds up the old active-set methods. It is straight-forward to implement, and in their paper Kim and Park demonstrate, on a variety of datasets (faces and text), that it performs better, both in convergence time and relative reconstruction error, than the MU and PG algorithms already mentioned in this discussion, and about the same as the CD method already mentioned. Interestingly, when used to solve the ANLS subproblems in nonnegative tensor factorization, this algorithm performs better than the CD method (see https://3e4b2411-a-62cb3a1a-s-sites.googlegroups.com/site/jingukim/2011_paper_hpscbook_ntf.pdf).

Python code for this algorithm already exists, although not in the scikit-learn API style: https://github.com/kimjingu/nonnegfac-python.

Definitely worth considering.

vene commented 9 years ago

Interesting. As scikit-learn doesn't deal with tensors, that sounds like further support of the CD solver :)

bearnshaw commented 9 years ago

Seems to me it argues for BPP since the two perform similarly on NMF, and when scikit-learn adds tensor support, the state-of-the-art ANLS solver for NTF will already be implemented. However, to be fair, it looks to me like the comparison done in the paper is to an unaccelerated version of the proposed CD solver (see http://download.springer.com/static/pdf/974/art%253A10.1007%252Fs10898-013-0035-4.pdf). So for proper comparison it would need to be added to @TomDLT 's tests.

vene commented 9 years ago

when scikit-learn adds tensor support

There is currently no plan for that to ever happen. That's what I meant, no reason to go for the fancier newer method if we'll never even need its benefits.

mblondel commented 9 years ago

I would just keep the CD solver. It would make everything so much simpler.

This is also my feeling. Having too many solvers makes things harder in case of refactoring. For instance, adding proper fit_intercept support in Ridge is harder because we have too many solvers to change. The CD code is also pretty simple and has a good stopping criterion.

Do we also want KL?

We definitely do but let's do it one PR at a time. For now let's focus on the squared loss and the API :) BTW, it's called generalized KL divergence or I divergence. The KL divergence is different. Many people (even in papers) make the confusion. Don't ask me why it's called generalized ;) Maybe @TomDLT can volunteer to write a CD solver later :)

API questions about regularization

Maybe elastic net style for consistency with other modules? However, even only L2 regularization will lead to sparse solutions in practice. Regarding factor specific regularization, this sounds complicated. Maybe we can keep the class API simple and expose a function with more options?

vene commented 9 years ago

Thanks @mblondel! Just to be sure: do you have some intuition about whether CD shouldn't work with I-divergence? The paper cited by @TomDLT earlier claims that PG wouldn't work (well), but the claim doesn't seem well backed.

mblondel commented 9 years ago

Could you quote the relevant sentence? One possible reason would be because the I-divergence doesn't have Lipschitz continuous first derivative (i.e. its second derivative is not bounded above). Most gradient-based solvers assume Lipschitz continuous gradient in their convergence proof.

However, for CD, the authors were still able to prove convergence to a stationary point: http://www.cs.utexas.edu/%7Ecjhsieh/nmf_kdd11.pdf

vene commented 9 years ago

Although a number of projected gradient algorithms [...] have been proposed for NMF based on least squared errors, the speedup advantage is more difficult to obtain for the approximation based on the I-divergence. The major reason is that the gradients of I-NMF [...] heavily depend on the scaling of W and H. [...] The badly scaled gradients also make the second order optimization methods such as Newton or quasi-Newton algorithms ill-posed and even fail to converge, as shown in Section 4

From here. But the results in section 4 don't seem to me to really support the very strong language used above.

vene commented 9 years ago

So am I reading this right, and what the kdd'11 paper calls KL-divergence (equation 3) is actually Generalized KL-divergence or I-divergence, right?

side-note: both of those are horrible names. What does the I in I-divergence stand for?

This claims that (paraphrase) "it's generalized in the sense that KL-Div(a, b) = I-Div(a, b) if a and b are probability distributions." I-Div is also a Bregman divergence.

mblondel commented 9 years ago

what the kdd'11 paper calls KL-divergence (equation 3) is actually Generalized KL-divergence or I-divergence, right?

Yes

TomDLT commented 9 years ago

After talking with @agramfort and various audio colleagues, it appears that:

More loss functions would be very useful !
The beta-divergence generalizes the I-divergence (aka generalized Kullback-Leibler divergence), the Itakura-Saito divergence, and also the Frobenius distance (aka squared norm).
These 3 divergences are used in practice, depending on the data. The MU method is the most used, and can easily implement the three losses through the beta-divergence. MU can be slightly generalized with the so-called Maximization-Minimization (MM) method.
For the particular case of the Frobenius norm, some other solvers are faster than MM, such as PG or CD methods.
The initialization is very important. Maybe we could add some classical heuristic, like taking the lowest error in several random initializations, or allowing the user to pass the matrices he wants as initialization.

I suggest to implement both MM method (in order to have easily different losses) and CD method (which seems a very efficient method for the Frobenius norm, as we have seen in the benchmarks).

In term of API, @vene and @mblondel suggested elastic net style (with alpha and l1_ratio), with the same regularization for W and H in the class. The class will call a function with more options to regularize either W, H or both.

Should we have several classes (MultiplicativeNMF, ProjectedGradientNMF, ...), or one NMF class and several solvers?

How should we deal with current solver? with current regularization trick?

amueller commented 9 years ago

I vote for one NMF class, different solvers. Deprecate the current PG solver and the parameters associated to it? It would make for a lot of parameter duplication for the next releases but it shouldn't be that big a deal, right?

vene commented 9 years ago

Should we have several classes (MultiplicativeNMF, ProjectedGradientNMF, ...), or one NMF class and several solvers?

I think we are avoiding naming classes after solvers. One NMF class that delegates to different solver functions might work.

How should we deal with current solver? with current regularization trick?

Sorry, I'm not exactly sure what you mean by this. If we implement CD there's no reason to keep PG unless benchmarks back it up in certain settings.

It would make for a lot of parameter duplication

Alternatively we could name the new class NonnegativeFactorization and deprecate NMF.

Does the maximization-minimization method support elastic-net penalty? Eqs. 73-74 give a solution for L1-penalty. If not, we could restrict the API to penalizing just with L1, and support the fully-expressive non-tied elastic net formulation in a cd_nmf function. @mblondel noted that L2 regularization + non-negative constraints will lead to sparse solutions anyway, maybe the L2 penalty converges faster?

ogrisel commented 9 years ago

There is also: The Diagonalized Newton Algorithm for Nonnegative Matrix Factorization by Hugo Van hamme (http://arxiv.org/abs/1301.3389) mentioned by @vene.

vene commented 9 years ago

In my comment I claimed that this algorithm was shown to outperform CD. It doesn't look like that is accurate, the comparisons are only against multiplicative updates. No idea why I said that. Maybe I thought CD meant something else? Sorry!

amueller commented 9 years ago

Should we close as #4852 was merged? Or keep open to discuss adding more losses?

TomDLT commented 9 years ago

Based on this discussion, I have worked on adding a multiplicative-update solver for NMF (more precisely Maximization-Minimization) in PR #5295. It handles all beta-divergences, including Frobenius norm, generalized Kullback-Leibler divergence and Itakura-Saito divergence.

TomDLT commented 7 years ago

Closing this since #4852 and #5295 are merged.

scikit-learn / scikit-learn

Discussion about adding NMF methods #4811