Dealing with intercepts

vguillemot commented 11 years ago

For regression centering the X matrix lead to similar model, than the one with intercept. For classification (logistic regression) this is not the case. 3 approaches to address this issue 1) Dummy: Add dummy variable in the X matrix 2) Separate: handle "betas" an "intercept" separately 3) As loss: add a loss to handle the intercept

Pros and cons: 1) Dummy: Pro: "f" and Lipschitz calculation are handled without modifications. Cons:

Penalization do not apply to intercept, so we need to inform "grad" method that one beta should be striped-off. Optimization algo. FISTA etc. need to be modified to inform the "grad" method of penalization that one column should be striped-off. Another solution would be to inform loss function in the constructor. with a flag such as contain_intercep=True
Do we have to modify the X matrix (leading to copy) ? Maybe it is possible to add one value to beta, and detect that beta.shape[0] == X.shape[1] + 1 and assume that the first column is the coefficient associated to the intercept. It seems, that this would lead to no modification of the optimization algo.

2) Separate Pro: Penalization calculation are handled without modifications. Cons: Optimization algo. FISTA etc. need to be modified to provide two parameters beta + intercept instead of one.

3) As loss: To be argued

tomlof commented 10 years ago

I think that it would be good if we could leave the algorithms unchanged. It would be possible and fairly simple to tell the functions that they should handle an intercept.

tomlof commented 10 years ago

So, after a discussion with @duchesnay and @FH235918 yesterday, we decided to try the following approach:

We only modify penalties such that they obtain a number of columns from the start of the matrix that they should ignore.

In the simplest case, with just the intercept, we tell e.g. the lasso penalty (L1) to ignore the first column, and only work with columns 2 through the end.

In a more general case, we could have several columns that we wish not to be penalised.

Proposed API change:

All functions that implement the Regularisation or the Constraint interfaces (e.g. functions in parsimony.functions.penalties) should take a constructor parameter "ignore" with default value 0. The argument ignore is thus the index of the first column that should be penalised.
Functions should honour this wish, and not penalise the ignore columns.
It is up to the user to make sure that the variables (columns) that should not be penalised are in the ignore first columns of the corresponding matrix.

When the ignore first columns are consecutive like this, a slice like X[:, ignore:] will not copy the data, but reference the underlying memory chunk, and we will thus not increase the memory usage by this change.

Comments suggestions?

duchesnay commented 10 years ago

Perfect !

duchesnay commented 10 years ago

Fouad:

How to compute the proximal operator with "intercept variables" ie.: unpenalized variables

FH235918 commented 10 years ago

Let say that we are dealing with the simple ridge model and only one intercept, so the penalty is the l2 norm of the (p-1) variables; we can write this as a sum of a separable two functions :f1(beta1,...,beta(p-1))= the l2 norm of this vector of p-1 variables and f2(beta0) = 0.

In this case, the proximal operator (which is a vector of p variables) is simply the concatenation of the two proximal operators. Since we can easily check that the proximal operator of the null function is the identity, we obtain that the proximal operator is just the concatenation of the identity operator applied to the intercept with the proximal operator of the l2 norm of the p-1 variables.

tomlof commented 10 years ago

Perfect!

tomlof commented 10 years ago

I have pushed the intercept changes to the multiblock branch ( ;-) ).

I will start testing it now.

tomlof commented 10 years ago

So, I have tested all (most) penalties, and it seems to be working. Let me know if you see anything weird.

Unfortunately, slicing the weight vector (beta) and/or the X matrix makes the code roughly half as fast, it runs takes twice as long to run.

So, our assumption seems to be wrong. There is considerable overhead.

However, I have made it so now that if there is not an intercept, nothing will be sliced and the speed will be optimal.

neurospin / pylearn-parsimony_history

Dealing with intercepts #3