tensorly / tensorly

TensorLy: Tensor Learning in Python.
http://tensorly.org
Other
1.55k stars 290 forks source link

Optimization submodule #80

Open cohenjer opened 5 years ago

cohenjer commented 5 years ago

Adding an optimization module

For now, Tensorly (TL) ships with one API for each particular tensor decomposition model. While this has the advantage of simplicity for the end-user, this limits the settings where TL can be used.

Proposition: optimization submodule

I think a very cool feature of tensorly would be, just like in PyTorch or SciPy, to have an optimization module with a few usual algorithms often used for constrained tensor decompositions (nonnegative least squares, Lasso, Gauss Newton, ADMM...). This is important in my opinion since specific applications sometimes require specific algorithms and constraint sets, and there should not be a default one for every scenario.

API Design

There are mainly two possibilities for such an integration.

1/ Do not use a separate submodule, but instead make one API for each constrained factorization (such as the current non_negative_parafac) with one default optimization method.
2/ Write a contrib.optim module, and use it in a decomposition.constrained_parafac function where one may choose the optimization method and the constraint set (among a few specific choices).

My opinion: go for contrib.optim. But I will make some tests and see how both solutions behave in practice. The only constraint I think should be handled in a very different way is sparsity, along the lines of issue #79.

Remark: using optimization backends?

After some tests and thinking (see for instance this post) I do not think using an optimization backend such as CVXPY for instance is satisfactory since:

JeanKossaifi commented 5 years ago

Great to have you on board @cohenjer and looking forward to the new algos! :)

As we discussed I agree on all points:

Regarding sparsity, it is still work in progress (#76, #77) but ideally we could simply try using the same algos (as done in #77) and specialise them as we go where needed.

JeanKossaifi commented 5 years ago

One really cool feature to add would be, given a CP decomposition learned on a given tensor, approximate the CP decomposition of new one. This would allow us to create a scikit-learn-like object implementing fit and transform. Thoughts?

animakumar commented 5 years ago

Yes! But can we ultimately make it more general, e.g. if it is Tucker, maintain as Tucker. For instance if you contract CP tensor with factors, Tucker form is more efficient for output. Can we also design heuristics to figure that out?

Anima

On Mon, Oct 29, 2018 at 11:00 AM Jean Kossaifi notifications@github.com wrote:

One really cool feature to add would be, given a CP decomposition learned on a given tensor, approximate the CP decomposition of new one. This would allow us to create a scikit-learn-like object implementing fit and transform. Thoughts?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tensorly/tensorly/issues/80#issuecomment-434013764, or mute the thread https://github.com/notifications/unsubscribe-auth/AAxPEYEUTtxr74tMq_HiC4OSOa441NKhks5up0I2gaJpZM4XyUTG .

cohenjer commented 5 years ago

Indeed it is probably a interesting feature. I think something very similar is implemented, for instance, in tensorlab v3 and later, where a tensor may be stored as a structured tensor instead. Is this what you are mentionning?

For instance, one may not manipulate the full dataset but only a Tucker or the Tensor Train representation. As mentionned by Anima, this does not only allow for efficient storage, but also efficient computations.

Anyway this is feature can be obtained by toying with the unfolding function and creating a structured tensor multiway product.

JeanKossaifi commented 5 years ago

Operating directly on decomposed tensors is a good feature - I am working on it. However in this case I am suggesting working not on the data structure modelling tensors directly but on the object we use to decompose them.

I am not sure if you are familiar with the scikit: basically you can create an object, say estimator = PCA(), which you can then fit to your data estimator.fit(X_train). Then, given new unseen data, you can project them onto the learned subspace doing estimator.transform(X_test).

In this case, I meant doing something similar for CP: once you obtain CP decomposition of a tensor, can you fit a new tensor. However, this is a little more involved than the Tucker case where we can just use the learned (orthogonal) projection factors to get the new core. Currently the decompositions are just functions but my goal is to move towards such objects. @cohenjer, It would also be easier to encapsulate in such an objects more algorithms like you propose in this issue, and hide the complexity from the users.

animakumar commented 5 years ago

Yes, this would be fantastic to have. What I was suggesting was to go one step further and figure out what is the optimal configuration of the output tensor (for instance sparse can become dense after transformation, CP can become Tucker etc.)

Anima

On Tue, Oct 30, 2018 at 2:14 AM Jean Kossaifi notifications@github.com wrote:

Operating directly on decomposed tensors is a good feature - I am working on it. However in this case I am suggesting working not on the data structure modelling tensors directly but on the object we use to decompose them.

I am not sure if you are familiar with the scikit: basically you can create an object, say estimator = PCA(), which you can then fit to your data estimator.fit(X_train). Then, given new unseen data, you can project them onto the learned subspace doing estimator.transform(X_test).

In this case, I meant doing something similar for CP: once you obtain CP decomposition of a tensor, can you fit a new tensor. However, this is a little more involved than the Tucker case where we can just use the learned (orthogonal) projection factors to get the new core. Currently the decompositions are just functions but my goal is to move towards such objects. @cohenjer https://github.com/cohenjer, It would also be easier to encapsulate in such an objects more algorithms like you propose in this issue, and hide the complexity from the users.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tensorly/tensorly/issues/80#issuecomment-434225095, or mute the thread https://github.com/notifications/unsubscribe-auth/AAxPETC1bYzvKy1xCpRFKti-3m_l8ldyks5uqBhggaJpZM4XyUTG .

cohenjer commented 5 years ago

I have made a first version of an optimization module in the tensorly.contrib of my tensorly fork. For now, I basically copied the ALS implementation of tl.decomp.parafac_candecomp.parafac into a tl.contrib.optim.optim_tensor_ls, and created a local version of candecomp_parafac, optim_candecomp_parafac. I also added a projected least squares algorithm (which is the easiest but worst constrained least squares solver) just for demo purpose.

The way this works is the following: the parafac function in optim_candecomp_parafac has a parameter Method which can be chosen by the user. If Method is 'ALS', then parafac calls the according optimization function in optim_tensor_ls. The same goes if Method is 'NALS'. The advantage here is that, say I were to implement a powerful nonnegative solver tuned for tensor problems, it could be plugged into the set of methods for parafac very easily.

This being said, the code as of now is probably a mess, in particular the "import" parts. We can discuss this anytime, and I'll work on implementing better stuff (like ADMM) in the near future.

JeanKossaifi commented 5 years ago

Thanks for the push! I agree that we want to move toward an API that provides an argument solver so we can specify, e.g. 'ALS' or 'NN-ALS'.

However, currently it's more or less just changing the API by copying code in separate functions (essentially the functions do one iteration of the corresponding algorithm). In particular, these functions are problem dependent (i.e. in their current form, they only apply to the very specific case of the CP decomposition).

I think the way forward to changing the API is to have a class Parafac that takes a solver as parameter, by either:

  1. making each solver function generic by having e.g. an ALS solver that takes right and left hand side and gives the update, so they can also be used for Tucker etc
  2. having specific functions, e.g. parafac_als that we call depending on the solver

I think 2) is the best way forward as it is the easiest to implement. 1) is more general and perhaps conceptually more satisfying but in practice might make the code become much more complicated (especially if we want to optimise the code for each case, for instance taking a khatri-rao product times an unfolded tensor can be done efficiently)..

In any case it's great to add more methods, e.g. projected ALS and ADMM! As a side note, ADMM is implemented for Robust Tensor PCA :)

cohenjer commented 5 years ago

Yes I agree with you, solution (2) seems more appealing. I think (1) would be formally nice, but practically either slow or difficult to optimize in terms of computation performance.

For now indeed I just copy-pasted and rewrote a few lines, but this is just for helping with the choice of the API. Once we reach something we like, I will add more interesting stuff. I'll try and update accordingly to (2) asap, but feel free to modify my fork as well.

JeanKossaifi commented 5 years ago

Maybe this is the occasion to introduce a class Parafac, implementing fit and transform. We could then have a nice consistent API:

cp = Parafac(rank=10, solver='ALS') # or 'NN-ALS', etc...
cp.fit(tensor)
factors = cp.transform(tensor)
new_factors = cp.transform(new_tensor)

The only issue being the transform for a new tensor the CP was not fitted on -- unlike in the Tucker case, projecting to and from the latent subspace is not trivial.

cohenjer commented 5 years ago

Ok here is a tentative API skeleton for such a class. I think now is also a good time to add the discussion around missing data #4 and fixed modes #72, which are respectively handled by weighting the entries of the input tensor, and by trivially not updating the fixed modes (I defined an attribute for that). This is just a first version, and I am no python expert, but at least this is what I have in mind.

I think however there is some confusion in your proposition: cp.fit(tensor) will basically compute a set of factors, so why do you need a method cp.transform to compute the factors? Do you not rather want to compute a core tensor? This is the same with the Tucker decomposition: given a fit (that is, given a set of factors U_i), I picture a tucker.transform method would projects the data tensor on the span of the U_i's. This can be done here using the factors obtained with cp.fit (the resulting core would be almost diagonal). Therefore, I would rather have written core=cp.transform(tensor) and core_new=cp.tranform(new_tensor). core_new would be tensor projected on the span of factors computed using cp.fit(tensor). Does this make sense?

class Parafac:
    def _init_(self, rank, method, init_method, constraints, weights, fixed modes, ...):
       # define the attributes here
       # there can be an attribute 'data' if an instance of Parafac should have some data attached to it.

   def fit(input = data, init_factors):
       # a wrapper function for calling the correct method depending on 'method' and 'constraints'
        factors = ... # we define a factors attribute to the instance of Parafac that contains the last fitted factors. We can then call factors = cp.factors to recover the factors after a cp.fit(tensor).

   def als(): 
   def admm(): #(...) Here we define the various standard optimization methods for Parafac, such as ALS, projected ALS, gradient descent, accelerated gradient descent, non-linear least squares, AO-ADMM and so on. 

   def transform(input_tensor = data, input_factors = factors):
      # projects an input_tensor on the subspace spanned by the factors. Defaults factors are the ones obtained by using Parafac.fit() lastly. 
JeanKossaifi commented 5 years ago

Yes, this is what I mentioned: in the Tucker case it is easy to transform, we simply return the core.

However, in the CP case, we only get the weights (equivalent to a super-diagonal tensor). But we could also for instance fix all modes but 1 (e.g. the batch size) and recover a new factor for this mode only.. However this becomes quite problem dependent so I guess for now, returning the weights only and finding the optimal weights given a new tensor makes most sense.

Regarding the class, I would have something similar, but leave admm and als out, e.g. as functions parafac_als and parafac_admm, which would simply be called by fit :) Scikit-learn is a great example for this.

cohenjer commented 5 years ago

Hi Jean, I'll be working on implementing the Parafac class and probably a few algorithms for unconstrained (Conjugate gradient, ALS with line search) and nonnegative CPD (HALS, a first version of ADMM) the coming week, let us discuss about the class structure after that. I'll try and mimick what I can find in scikit-learn.

cohenjer commented 5 years ago

Some demo:

https://github.com/cohenjer/tensorly/blob/master/tensorly/contrib/optimization/tests/test_class_parafac.ipynb

Should I open a pull request ? There is still lots to do (in particular, there is a large amount of work to do in rewriting some of the docs to account for the new syntax ? I also want to add more algorithms, but I have no time right now)

JeanKossaifi commented 5 years ago

Thanks, I'll have a look in the coming days.