koaning / scikit-lego

Extra blocks for scikit-learn pipelines.
https://koaning.github.io/scikit-lego/
MIT License
1.28k stars 117 forks source link

[BUG] LowessRegressor: ZeroDivisionError: Weights sum to zero, can't be normalized #707

Closed srt19170 closed 1 week ago

srt19170 commented 3 weeks ago

Using LowessRegressor, I received a "ZeroDivisionError: Weights sum to zero, can't be normalized" error. A little debugging revealed that the problem is in _calc_wts:

    def _calc_wts(self, x_i):
        """Calculate the weights for a single point `x_i` using the training data `self.X_` and the parameters
        `self.sigma` and `self.span`. The weights are calculated as `np.exp(-(distances**2) / self.sigma)`,
        where distances are the distances between `x_i` and all the training samples.

        If `self.span` is not None, then the weights are multiplied by
        `(distances <= np.quantile(distances, q=self.span))`.
        """
        distances = np.linalg.norm(self.X_ - x_i, axis=1)
        weights = np.exp(-(distances**2) / self.sigma)
        if self.span:
            weights = weights * (distances <= np.quantile(distances, q=self.span))
        return weights

If sigma is not sized properly, the scaled distances will be large and cause np.exp() to return zeros for the weights, resulting in an exception being raised when the weights are applied.

This seems like the wrong behavior. Perhaps the weights should be clipped to some minimum non-zero weight so that the predictor can still be used, perhaps with a warning message to explain the problem?

FBruzzesi commented 3 weeks ago

Hey @srt19170 , thanks for spotting and raising this issue.

This seems like the wrong behavior.

Not sure I would call it wrong 😇

perhaps with a warning message to explain the problem?

Surely a better error message can be provided.

Perhaps the weights should be clipped to some minimum non-zero weight so that the predictor can still be used

My understanding is that this issue would occur only if all the -(distances**2) / self.sigma values blow up to minus infinity, thus all the weights are zero. Clipping all elements to any constant value would lead to compute the average of y in

results = np.stack([np.average(self.y_, weights=self._calc_wts(x_i=x_i)) for x_i in X])

I am not too comfortable with doing that implicitly for the user. I would still tend to raise an error that explains the issue

FBruzzesi commented 3 weeks ago

RFC on the following error message:

try:
    results = np.stack([np.average(self.y_, weights=self._calc_wts(x_i=x_i)) for x_i in X])
except ZeroDivisionError:
    msg = (
        "Weights, resulting from `np.exp(-(distances**2) / self.sigma)`, are all zero. "
        "Try to increase the value of `sigma` or to normalize the input data.\n\n"
        "`distances` refer to the distance between each sample `x_i` with all the"
        "training samples."
    )
    raise ValueError(msg)
koaning commented 3 weeks ago

+1 on the better error message.

@srt19170 is there anything you can share about your dataset that might help us understand the kinds of datasets for which this issue comes up?

srt19170 commented 3 weeks ago

My data has 75 features, and I'm using StandardScaler to condition the data. In the initial test set of data there are 2905 elements, and it looks like the distances end up being in the range 8-35. My first attempt at this I just copied the documentation and used sigma=0.1, which in this case is right at the edge of where it turns all the weights to zero.

I admit my math is weak -- is the norm being calculated more likely to yield large numbers when the number of features is large?

At any rate, thanks for considering the issue and the proposed error message would have helped me figure it out more quickly.