neurospin / pylearn-parsimony_history

Sparse and Structured Machine Learning in Python
BSD 3-Clause "New" or "Revised" License
0 stars 1 forks source link

Dealing with intercepts #3

Open vguillemot opened 11 years ago

vguillemot commented 11 years ago

For regression centering the X matrix lead to similar model, than the one with intercept. For classification (logistic regression) this is not the case. 3 approaches to address this issue 1) Dummy: Add dummy variable in the X matrix 2) Separate: handle "betas" an "intercept" separately 3) As loss: add a loss to handle the intercept

Pros and cons: 1) Dummy: Pro: "f" and Lipschitz calculation are handled without modifications. Cons:

2) Separate Pro: Penalization calculation are handled without modifications. Cons: Optimization algo. FISTA etc. need to be modified to provide two parameters beta + intercept instead of one.

3) As loss: To be argued

tomlof commented 10 years ago

I think that it would be good if we could leave the algorithms unchanged. It would be possible and fairly simple to tell the functions that they should handle an intercept.

tomlof commented 10 years ago

So, after a discussion with @duchesnay and @FH235918 yesterday, we decided to try the following approach:

We only modify penalties such that they obtain a number of columns from the start of the matrix that they should ignore.

In the simplest case, with just the intercept, we tell e.g. the lasso penalty (L1) to ignore the first column, and only work with columns 2 through the end.

In a more general case, we could have several columns that we wish not to be penalised.

Proposed API change:

When the ignore first columns are consecutive like this, a slice like X[:, ignore:] will not copy the data, but reference the underlying memory chunk, and we will thus not increase the memory usage by this change.

Comments suggestions?

duchesnay commented 10 years ago

Perfect !

duchesnay commented 10 years ago

Fouad:

How to compute the proximal operator with "intercept variables" ie.: unpenalized variables

FH235918 commented 10 years ago

Let say that we are dealing with the simple ridge model and only one intercept, so the penalty is the l2 norm of the (p-1) variables; we can write this as a sum of a separable two functions :f1(beta1,...,beta(p-1))= the l2 norm of this vector of p-1 variables and f2(beta0) = 0.

In this case, the proximal operator (which is a vector of p variables) is simply the concatenation of the two proximal operators. Since we can easily check that the proximal operator of the null function is the identity, we obtain that the proximal operator is just the concatenation of the identity operator applied to the intercept with the proximal operator of the l2 norm of the p-1 variables.

tomlof commented 10 years ago

Perfect!

tomlof commented 10 years ago

I have pushed the intercept changes to the multiblock branch ( ;-) ).

I will start testing it now.

tomlof commented 10 years ago

So, I have tested all (most) penalties, and it seems to be working. Let me know if you see anything weird.

Unfortunately, slicing the weight vector (beta) and/or the X matrix makes the code roughly half as fast, it runs takes twice as long to run.

So, our assumption seems to be wrong. There is considerable overhead.

However, I have made it so now that if there is not an intercept, nothing will be sliced and the speed will be optimal.