RFC Sample weight invariance properties

rth commented 5 years ago

This can wait after the release.

A discussion happened in the GLM PR https://github.com/scikit-learn/scikit-learn/pull/14300 about what properties we would like sample_weight to have.

Current Versions

First, a short side comment about 3 ways simple weights (s_i) are currently used in loss functions with regularized generalized linear models in scikit-learn (as far as I understand),

Version 1a: $L_{1a}(\omega) = \sum_i s_i \cdot l(x_i, \omega) + \alpha \lVert \omega\rVert$

For instance: Ridge (also LogisticRegression where C=1/α)
Version 2a: $L{2a}(\omega) = \frac{1}{n{\text{samples}}}\sum_i s_i \cdot l(x_i, \omega) + \alpha \lVert \omega\rVert$

For instance: SGDClassifier? (maybe Lasso, ElasticNet once they are added?)
Version 2b: $L_{2b}(\omega) = \frac{1}{\sum_i s_i}\sum_i s_i \cdot l(x_i, \omega) + \alpha \lVert \omega\rVert$

For instance, currently proposed in the GLM PR for PoissonRegressor etc (edit: meanwhile implemented this way)

Properties

For sample weight it's useful to think in term of invariant properties, as they can be directly expressed in common tests. For instance,

checking that zero sample weight is equivalent to ignoring samples in https://github.com/scikit-learn/scikit-learn/pull/15015 (replaced by #17176) helped discovering a number of issues. ~~All of the above formulations should verify this.~~ It is verified only by L_1a and L_2b.

Similarly, paraphrasing https://github.com/scikit-learn/scikit-learn/pull/14300#issuecomment-543177937 other properties we might want to enforce, are,

multiplying some sample weight by N is equivalent to repeating the corresponding samples N times. It is verified only by L_1a and L_2b. Example: For L_2a setting all weights to 2, is equivalent to having 2x more samples only if α = α / 2.
Finally, that scaling sample weight has no effect. This is only verified by L_2b. For both L_1a and L_2a multiplying all samples weights by k is equivalent to setting α = α / k.

This one is more controversial. Against enforcing this,
- there are arguments of keeping a meaning for business metrics (e.g. https://github.com/scikit-learn/scikit-learn/issues/15651)
  
  in favor,
- that we don't want a coupling between using samples weight and regularization. Example: Say one has a model without sample weights, and one wants to see if applying samples weights (imbalanced dataset, sample uncertainty, etc) improves it. Without this property it's difficult to conclude: is the evaluation metric better with sample weights, due to those, or simply because we now have a better regularized model? One has to simultaneously consider these two factors.

Whether we want/need consistency between the use of sample weight in metrics in estimators is another question. I'm not convinced we do, since in most cases estimators don't care about the global scaling of the loss function, and these formulations are equivalent up to a scaling of the regularization parameter. So maybe using the L_1a equivalent expression in metrics could be fine.

In any case, we need to decide the behavior we want. This is a blocker for,

Poisson, Gamma and Tweedie Regression https://github.com/scikit-learn/scikit-learn/pull/14300
adding sample weights in ElasticNet and Lasso https://github.com/scikit-learn/scikit-learn/pull/15436
other tests for sample weights consistency in linear models by @lorentzenchr in https://github.com/scikit-learn/scikit-learn/pull/15554

Note: Ridge actually seem to have a different sample weight behavior for dense and sparse as reported in https://github.com/scikit-learn/scikit-learn/issues/15438

@agramfort 's option on this can be found in https://github.com/scikit-learn/scikit-learn/issues/15651#issuecomment-555210612 (if I understood correctly).

Please correct if I missed something (this could also use a more in depth review of how it is done in other libraries).

jnothman commented 5 years ago

I'd usually think that if we design the objective function such that the regularisation coefficient is sample size invariant, then it should be invariant to the scale of sample weights.

But I've not thoroughly understood the case for L2a or similar in regression evaluation metrics.

On the other hand, in some more discrete algorithms like DBSCAN we have opted for treating the absolute value of sample weights as important, since it needs to correspond to the min_samples parameter (which is then applied to discretely include or exclude a point from a neighborhood).

rth commented 5 years ago

I'd usually think that if we design the objective function such that the regularisation coefficient is sample size invariant, then it should be invariant to the scale of sample weights.

Yes, that is one of the conclusions I would like to reach with this issue (i.e. that L_2a should be avoided). We had a discussion about it in https://github.com/scikit-learn/scikit-learn/pull/14300#discussion_r329370790 and at the time I didn't have a clear justification. In general, we don't document well how sample weight are used in the loss functions of linear models, so it's hard to tell what is actually done (short of looking at the code which can be solver dependent). It could be useful to go through each model and ensure that it is actually L_2b (if regularization is sample size invariant). Though glmnet seems to use L_2a.

in some more discrete algorithms like DBSCAN we have opted for treating the absolute value of sample weights as important, since it needs to correspond to the min_samples parameter

Thanks, good to know! Maybe we should document sample_weight better in general.

NicolasHug commented 5 years ago

Thanks for opening this.

multiplying some sample weight by N is equivalent to repeating the corresponding samples N times

To me, this is the core definition of what sample weights are (it generalizes to N being a real number). The way we should define the losses, and everything else, should derive from this definition. Not the other way around. That's how I understand SW at least, but please correct me if I'm wrong.

Regarding the losses, we should distinguish what we say in the UG, and what we actually compute. The point being that L1a, L2a and L2b (above) all give the same solution right? So when we say in the UG "That algorithm optimizes the following loss function..." we could just plug there any of these losses and still be correct. Though I understand how that's an issue when users start comparing their actual values.

rth commented 5 years ago

The point being that L1a, L2a and L2b (above) all give the same solution right? So when we say in the UG "That algorithm optimizes the following loss function..." we could just plug there any of these losses and still be correct.

They give the same solution, only if regularization is adjusted by some scaling factor (but I think few people would do that). For a given value of regularization and sample weight, the solution will not be the same. That's why not documenting it is not very transparent.

jeremiedbb commented 5 years ago

To me property 2. is what people generally think sample weight are, so we should ensure that. Property 1 is property 2 with N=0.

I can't say much about property 3 from a use case point of view. Since I come from theoretical physics I like invariants but that's not a proper argument :) As I understood, some estimators use L1a (e.g. lasso) and others L2a (e.g. ridge) to have a good default for the regularization parameter, independent of the number of samples for instance. Enforcing prop 3. would require defining the default value as α * n_samples.

NicolasHug commented 5 years ago

If we consider that 2. is the definition of what SW are, I'm not sure that 3. is something we can deduce for all estimators.

Some models are actually sensitive to this notion of "number of samples" or "sum of weights". Take for example the trees which have a min_sample_leaf parameter (which BTW should probably become min_sample_weight_leaf if we support SW, @adrinjalali).

That parameter makes sure each leaf has at least N samples or equivalently if you pass SW, that the sum of their weights is at least N.

If you multiply your SW by 2, clearly you get a different tree, so that's not an invariant here. To get the same tree you would need to multiply min_sample_weight_leaf by 2 as well.

jeremiedbb commented 5 years ago

Or it means min_sample_leaf is not well defined and should probably be "makes sure each leaf has at least x% samples (or weights)". ( mostly kidding but not entirely :) )

NicolasHug commented 5 years ago

That would indeed "fix" the issue I pointed out, but using a percentage wouldn't be a practical parameter. The reason min_sample_leaf exists is for users to say "I want the values of the leaves to (which is typically a mean or a median or a majority vote) to be computed with at least 20 values, so that it's more or less significant". A percentage would not be an intuitive way of dealing with that sort of constraint.

rth commented 5 years ago

If we consider that 2. is the definition of what SW are, I'm not sure that 3. is something we can deduce for all estimators.

I agree, Prop 3 should only apply to models that are invariant by repeating all samples. Say you take L_2a, you repeat all samples twice, you get the same loss. By virtue of Prop 2, this should be equivalent to multiplying samples by 2 of the original model: i.g. that the model is sample_weight scale invariant or Prop 3.

Or inversely, if a model is invariant by repeating all samples but doesn't verify Prop 3, it means it would also not verify Prop 2 (which is the case for L_2a).

In the end you are probably right that only enforcing Prop 2 (and 1 as a special case) should be enough.

lorentzenchr commented 5 years ago

"makes sure each leaf has at least x% samples (or weights)"

Sample weights might have a "unit" and therefore deeper meaning. An example are severity models for insurance claims (claim amount / number of claims), where the number of claims are the sample weights. Being able to specify min_samples_leaf in terms of absolute number of claims makes a lot of sense, also from a statistical point of view. I'm sure this line of reasoning also works for other business cases, e.g. with number of customers as sw etc.

Property 2 is also for me the defining property of sw. For linear models, I would have wished for property 3 because it gives the loss a good interpretation—average loss per sample (weight)—comparability across datasets and consistency with the metrics.

PS @jeremiedbb : I really like invariance laws—if not a proper, mentioning them is at least a beautiful argument :smile:

jnothman commented 5 years ago

I think we have a general sense here that ordinarily the three invariances should hold.

However, I think we can find a bit more clarity about the exceptions to that rule, if we flip the question on its head and say: which parameters should be invariant to the scale of the weights (and which not)?

And then we have three classes:

n_samples-sensitive: parameters which (should) disregard the weights and are affected by the number of samples (e.g. DecisionTree*.min_samples_leaf, and currently Ridge.alpha, MLP*.batch_size, *ShuffleSplit.test_size).
scale-sensitive: parameters which (should) be affected by the scale of the weights but not the number of samples, such as DBSCAN.min_samples.
scale-invariant: parameters which (should) be invariant to the scale of the weights (e.g. DecisionTree*.min_weight_fraction_leaf, PoissonRegressor.alpha).

I think we can see valid use cases for each approach for some of these parameters. I think there is scope to argue that we have made the wrong choices (or indeed that the current definition of loss in Ridge wrt alpha and weights is a bug), or that we have been inflexible to relevant use cases, and that we can redefine or recreate some parameters.

jnothman commented 5 years ago

I think since people have been thinking in terms of algorithms, it might take a bit of work to reframe your thoughts in terms of parameters. But testing for invariances makes this clearer. If we want alpha to be scale-invariant wrt weights, then we test for all three invariances. If we want it to be scale-sensitive then we test that a modification to alpha corresponds to a modification to weight scale. If we want it to be n_samples-sensitive then we test that adjusting sample weights, number of samples and alpha in correspondence achieves the same result (not that I can think clearly on how to formulate this, which is probably a sign that alpha should not be n_samples-sensitive).

NicolasHug commented 5 years ago

We should also discuss how subsampling the training set behaves w.r.t sample weights. This is relevant for any estimator that subsamples the training set, and particularly for random forests with bootstrapping.

Consider the definition of SW as 2. above:

multiplying some sample weight by N is equivalent to repeating the corresponding samples N times

And consider the random forests, that compute subsamples of the training set with replacement. If the subsampling takes SW into account (the higher the weight, the more likely to be picked), then I think the SW should be ignored in the rest of the algorithm (i.e. the tree building), because the SW have already been accounted for in the subsampling procedure.

I'm not sure how we actually handle SW in the forests, but that's something to keep in mind IMO.

Also, things can get pretty nasty when the boostrap subsampling takes class imbalance into account, cf https://github.com/scikit-learn/scikit-learn/pull/13227#issuecomment-556026847

orangecalculator commented 4 years ago

Hello. I found this discussion while I was using Adaboost with LogisticRegression as a classifier.

In the following line of AdaboostClassifier, sample_weight is normalized as summing up to 1 on every iteration. I think this means that the sample_weight parameters must have some consistency on it's behavior because this kind of generic programming should be possible.

#scikit-learn/sklearn/ensemble/_weight_boosting.py
#line 161
            if iboost < self.n_estimators - 1:    
                # Normalize   
                sample_weight /= sample_weight_sum

I think this is natural. As a user, when I saw this behavior in the first place, I thought it was a bug.

In contrast, if we use sample_weight=None, sample_weight is given by np.ones().

#scikit-learn/sklearn/linear_model/_logistic.py
#line 120
    if sample_weight is None:
        sample_weight = np.ones(n_samples)

So this means that when you plug in C=alpha into a single LogisticRegression, you should plug in C=n_sample * alpha for AdaBoostClassifier with LogisticRegression.

Edit: However, on some simulations using this adjustment, this kind of produced some behavior of slow convergence. Following is a link to Google Colab Notebook which expresses this behavior. link: https://colab.research.google.com/drive/1JXEVs6fHx-_sUV7qREmMAVThxWoJIGFG

Note: LogisticRegression with penalty='l2' was scale insensitive within some numerically small errors.

rth commented 4 years ago

Just found out about https://github.com/scikit-learn/scikit-learn/issues/11316 by Joel on the same subject 1.5 years earlier ...

I think we can find a bit more clarity about the exceptions to that rule, if we flip the question on its head and say: which parameters should be invariant to the scale of the weights (and which not)?

I agree @jnothman , that sounds like a very nice generalization! It's a bit harder to enforce that systematically though.

glemaitre commented 2 years ago

I am coming back on this RFC because it recently appear in some new issues and discussions. Putting @agramfort in the loop. As I mentioned in the different discussions, I think that we should settle on the loss of the linear model and how sample-weight is used. Once we have that we can design the expected tests and solve any bugs if there are any. In addition, we should then document it properly and I assume that the potential caveat there would be related to the coefficients scale or the regularization parameter values.

ogrisel commented 2 years ago

I think that we should enforce (and test) the intuitive behavior that duplicating a sample in the training set without weight should be equivalent to setting its weight to 2.0.

There is a test for linear models that was proposed in #15554.

ogrisel commented 2 months ago

I think the general expectation is that weighting training points with positive integer weights (including 0 or 1 but not only) should be equivalent to repeating those training points the same number of times (equivalent of dropping for 0 weights, and no repetition for unit weights). This strategy is being used in a dedicated test in #29419 and #29442.

This is a general rule that includes the kind="ones" and kind="zeros" testing strategies currently implemented in check_sample_weights_invariance. I think this check should be updated to include the general case with a mix of random integer weights between 0 and 3 instead.

ogrisel commented 2 months ago

Note that #16298 is a tracking meta-issue for estimators that have been detected as not correctly handling sample_weight.

scikit-learn / scikit-learn

RFC Sample weight invariance properties #15657

Current Versions

Properties