Open romanovzky opened 3 months ago
Yes, this feature is useful in many applications. However, I would recommend not to handle this via sample weights for a loss function but via likelihood functions, which are anyway prepared to be integrated into Operon. For squared-error loss, your case is equivalent to a Gaussian likelihood with sample-specific error variance. The negative log likelihood is:
$$ nll = 1/2 \sum_i^n \log (2 \pi \sigma_i^2) + 1/2 \sum_i^n \sigma_i^{-2}(y_i - f(x_i, \theta))^2$$
The $\sigma_i$ would be proportional to your weights. If they are constant the first term in the nll can be ignored.
Handling this issue via likelihoods, with a few commonly used ones predefined in operon, is more generally applicable. Weights for a loss-function are a special case.
Hmmm I think these are different things.
The likelihood method that you are mentioning is with respect to data that have uncertainty associateds to the target values, in this case the $\sigma_i$ corresponde to the uncertainty (e.g. experimental error) associated with the measured $y_i$. This is what is used to fit a (explicit and domain knowledge-driven) model with the so-called $\chi^2$ (chi-squared) minimisation.
What I suggest is to apply these weights at a sample statistical level. It is common to have data sampled which needs to be statistically re-weighted, a good example is social data where sampling is difficult and from a random sampling of data one then needs to re-weight the sample to better approximate known global statistics (apparently this is called post-stratification). Another example happens in Physics a lot when sampling data using an MCMC with a bias towards tails of the distribution, the end sample is then re-weighted back to the "true" distribution. In both cases, one uses weights to derive the statistics of the dataset through weighted averages, standard deviations, etc.
As such, I believe that the first case (your suggestion) is a different type of loss function, say "chi-squarted MSE" (or non-constant std MSE), whereas the second case (my suggestion) is a statistical re-weighting of virtually any loss function.
I was refering to maximum likelihood estimation of parameters. I see your point that sometimes we would not call the objective function a likelihood even when the mathematical structure matches exactly a well-known likelihood.
Oh, completely. There are many loss functions in the ML literature that do not have a likelihood interpretation.
My request, however, is at the level of how the loss is computed to train the ML model over data sample, i.e. at the distributional level of the data points. For example, when pyoperon
computes the MSE over the dataset, it performs
$$MSE = \frac{1}{N}\sum_i^N MSE(X_i, y_i) = \frac{1}{N}\sum_i^N (\hat y(X_i) - y_i)^2$$
which implies that all data points, $(X_i, y_i)$, are equally likely with probability $p_i=1/N$, i.e. the MSE is the expected value of the MSE over the distribution of the data, $(X_i, y_i) \sim \mathcal{D}$,
$$MSE = \frac{1}{N}\sum_i^N (\hat y(X_i) - y_i)^2 = \sum_i^N p_i (\hat y(X_i) - yi)^2 = \mathbb{E}{\mathcal{D}}[[MSE(X_i,y_i)]]$$
Now, I might have a data generating process, i.e. a sampling procedure, that does not produce data points with the same probability, but instead $p_i = w_i/\sum_i w_i$, such that
$$MSE = \mathbb{E}_{\mathcal{D}}[[MSE(X_i,y_i)]]= \frac{1}{w_i} \sum_i^N w_i MSE(X_i, y_i)$$
where $w_i$ are (non-negative) weights.
This functionality exists out-of-the-box for scikit-learn
estimators, from which SymbolicRegressor
inherits, (see here,here and here), as well as in other common packages such as Keras
, XGBoost
, Lightgbm
, etc
If pyoperon
was purely python I'd propose a MR, but since it's a wrapper to operon
and my C++
is rusty I don't have the time to delve on this at the moment... this was the main shortcoming of pyoperon
in our paper, sample-weights
would have made our study a lot easier.
There's already a plan in place to factor out the metrics to a different library: https://heal-research.heuristiclab.com/vstat/group__Metrics.html
Do the metrics in there satisfy your use case?
Yes! This is exactly the requested functionality!
In this work (preprint), we had to resample the dataset so that the target variable had a flat distribution to prevent
pyoperon
from only learning about the most common values. A better alternative, thatpyoperon
does not currently support, is to usesample_weights
when computing the loss function.For example, we would have
X
,y
,w
(sample weights), then the loss function is computed as the weighted average of the losses of each data instance $$loss_{total} = \frac{1}{\sum_i w_i} \sum_i w_i \times loss(X_i, y_i)$$ this is already supported by manyscikit-learn
estimators.The use case would be varied: from specifying the focus of a regression task, to use datasets that have been sampled in a way that the data come with sampling weights (this is common practice in social sciences and in simulation-based inference in STEM).