heal-research / pyoperon

Python bindings and scikit-learn interface for the Operon library for symbolic regression.
MIT License
38 stars 11 forks source link

Feature request: Sample weights #17

Open romanovzky opened 3 months ago

romanovzky commented 3 months ago

In this work (preprint), we had to resample the dataset so that the target variable had a flat distribution to prevent pyoperon from only learning about the most common values. A better alternative, that pyoperon does not currently support, is to use sample_weights when computing the loss function.

For example, we would have X, y, w (sample weights), then the loss function is computed as the weighted average of the losses of each data instance $$loss_{total} = \frac{1}{\sum_i w_i} \sum_i w_i \times loss(X_i, y_i)$$ this is already supported by many scikit-learn estimators.

The use case would be varied: from specifying the focus of a regression task, to use datasets that have been sampled in a way that the data come with sampling weights (this is common practice in social sciences and in simulation-based inference in STEM).

gkronber commented 3 months ago

Yes, this feature is useful in many applications. However, I would recommend not to handle this via sample weights for a loss function but via likelihood functions, which are anyway prepared to be integrated into Operon. For squared-error loss, your case is equivalent to a Gaussian likelihood with sample-specific error variance. The negative log likelihood is:

$$ nll = 1/2 \sum_i^n \log (2 \pi \sigma_i^2) + 1/2 \sum_i^n \sigma_i^{-2}(y_i - f(x_i, \theta))^2$$

The $\sigma_i$ would be proportional to your weights. If they are constant the first term in the nll can be ignored.

Handling this issue via likelihoods, with a few commonly used ones predefined in operon, is more generally applicable. Weights for a loss-function are a special case.

romanovzky commented 3 months ago

Hmmm I think these are different things.

The likelihood method that you are mentioning is with respect to data that have uncertainty associateds to the target values, in this case the $\sigma_i$ corresponde to the uncertainty (e.g. experimental error) associated with the measured $y_i$. This is what is used to fit a (explicit and domain knowledge-driven) model with the so-called $\chi^2$ (chi-squared) minimisation.

What I suggest is to apply these weights at a sample statistical level. It is common to have data sampled which needs to be statistically re-weighted, a good example is social data where sampling is difficult and from a random sampling of data one then needs to re-weight the sample to better approximate known global statistics (apparently this is called post-stratification). Another example happens in Physics a lot when sampling data using an MCMC with a bias towards tails of the distribution, the end sample is then re-weighted back to the "true" distribution. In both cases, one uses weights to derive the statistics of the dataset through weighted averages, standard deviations, etc.

As such, I believe that the first case (your suggestion) is a different type of loss function, say "chi-squarted MSE" (or non-constant std MSE), whereas the second case (my suggestion) is a statistical re-weighting of virtually any loss function.

gkronber commented 3 months ago

I was refering to maximum likelihood estimation of parameters. I see your point that sometimes we would not call the objective function a likelihood even when the mathematical structure matches exactly a well-known likelihood.

romanovzky commented 3 months ago

Oh, completely. There are many loss functions in the ML literature that do not have a likelihood interpretation.

My request, however, is at the level of how the loss is computed to train the ML model over data sample, i.e. at the distributional level of the data points. For example, when pyoperon computes the MSE over the dataset, it performs

$$MSE = \frac{1}{N}\sum_i^N MSE(X_i, y_i) = \frac{1}{N}\sum_i^N (\hat y(X_i) - y_i)^2$$

which implies that all data points, $(X_i, y_i)$, are equally likely with probability $p_i=1/N$, i.e. the MSE is the expected value of the MSE over the distribution of the data, $(X_i, y_i) \sim \mathcal{D}$,

$$MSE = \frac{1}{N}\sum_i^N (\hat y(X_i) - y_i)^2 = \sum_i^N p_i (\hat y(X_i) - yi)^2 = \mathbb{E}{\mathcal{D}}[[MSE(X_i,y_i)]]$$

Now, I might have a data generating process, i.e. a sampling procedure, that does not produce data points with the same probability, but instead $p_i = w_i/\sum_i w_i$, such that

$$MSE = \mathbb{E}_{\mathcal{D}}[[MSE(X_i,y_i)]]= \frac{1}{w_i} \sum_i^N w_i MSE(X_i, y_i)$$

where $w_i$ are (non-negative) weights.

This functionality exists out-of-the-box for scikit-learn estimators, from which SymbolicRegressor inherits, (see here,here and here), as well as in other common packages such as Keras, XGBoost, Lightgbm, etc

If pyoperon was purely python I'd propose a MR, but since it's a wrapper to operon and my C++ is rusty I don't have the time to delve on this at the moment... this was the main shortcoming of pyoperon in our paper, sample-weights would have made our study a lot easier.

foolnotion commented 3 months ago

There's already a plan in place to factor out the metrics to a different library: https://heal-research.heuristiclab.com/vstat/group__Metrics.html

Do the metrics in there satisfy your use case?

romanovzky commented 3 months ago

Yes! This is exactly the requested functionality!