Reconsider penalty scaling for SLOPE

jolars commented 4 years ago

In SLOPE version 0.3.0 and above, the penalty in the SLOPE objective is scaled depending on the type of scaling that is used in the call to SLOPE(). The behavior is:

for scaling = "l1", no scaling is applied
for scaling = "l2", the penalty is scaled with sqrt(n)
for scaling = "sd", the penalty is scaled withn`.

There are advantages and disadvantages of doing this kind of scaling, and I think a discussion is warranted regarding what the correct behavior should be.

Pros

Regularization strength is independent from the number of observations, which means that the same level of regularization is applied over, for instance, differently sized resamples in cross-validation or when fitting a trained model on a test data set.
Scaling the penalty is standard practice in many implementations of l1-regularized models, such as glmnet, ncvreg, biglasso
Having regularization strength independent from the number of observations means that the model can still control for misspecification as n becomes large.

Cons

The fact that the penalty scaling differs depending on type of standardization can be confusing.
Overfitting becomes less and less of an issue as n becomes larger, so it makes sense to decrease the regularization strength as n grows.
The model definition is now somewhat different from the definitions used in almost all publications, which also means that the interpretation of the alpha parameter as variance in the orthogonal X case is lost.

Possible solutions

Whichever way we go with this, I think we should keep the other option available as a toggle, i.e. add an argument along the lines of penalty_scaling to turn off/on penalty scaling, or even to provide a more fine-grained type of penalty scaling. That way, it would be possible to achieve either behavior, which, really, means that this discussion is really about what the default should be.

Thoughts? Ideas?

References

Hastie et al. (2015) mentions that scaling with n is "useful for cross-validation" and makes lambda values comparable for different sizes of samples, but otherwise doesn't seem to mention it.

Hastie, T., Tibshirani, R., & Wainwright, M. (2015). Statistical learning with sparsity: The lasso and generalizations (1 edition). Chapman and Hall/CRC.

scikit-learn has a brief article covering these things here: https://scikit-learn.org/stable/auto_examples/svm/plot_svm_scale_c.html

JonasWallin commented 4 years ago

As default I would use the same as glmnet? I agree that it should def be an option. Could you put in some references to what people are doing at different places?

jolars commented 4 years ago

As default I would use the same as glmnet? I agree that it should def be an option. Could you put in some references to what people are doing at different places?

I updated the post with a couple of references, but I'm having a hard time finding references on this.

JonasWallin commented 4 years ago

Could you start an overleaf of this also? We should write down the equations so one can have clearer disccusion about them. Further the naming should be on the scaling not loss function, in my oponion. I.e. 'l1' should be 'none', then if we have 'l1' loss implemented we should say that default there is none?

jolars commented 4 years ago

Could you start an overleaf of this also? We should write down the equations so one can have clearer disccusion about them.

Yes, absolutely.

Further the naming should be on the scaling not loss function, in my oponion. I.e. 'l1' should be 'none', then if we have 'l1' loss implemented we should say that default there is none?

not exactly sure what you mean here

JonasWallin commented 3 years ago

not exactly sure what you mean here

scaling = "l1", no scaling is applied. The scaling is should not be named after lose function so rather. scaling = 'none'.

jolars / SLOPE