Question on LogScore values on training and validation sets

ivan-marroquin commented 3 years ago

Many thanks for this great package!

I have questions regarding the values computed for LogScore and the way how the best estimator is selected based on LogScore. I used this Python code: ########################### import numpy as np import ngboost from sklearn.datasets import load_boston from sklearn.metrics import median_absolute_error from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt

if name == 'main': x, y= load_boston(return_X_y= True) combined= np.concatenate((x, y.reshape(-1,1)), axis= 1)
mean_scaler= np.mean(combined, axis= 0)
std_scaler= np.std(combined, axis= 0)
combined= (combined - mean_scaler) / std_scaler

x= combined[:,:-1].astype(np.float32).copy()

y= combined[:,-1].astype(np.float32).copy()

x_train, x_validation, y_train, y_validation= train_test_split(x, y, test_size= 0.4, random_state= 1969)

ngb_2= ngboost.NGBRegressor(Dist= ngboost.distns.Normal, Score= ngboost.scores.LogScore, natural_gradient= True, n_estimators= 300, learning_rate= 0.01, verbose= False, random_state= 1969)

ngb_2.fit(x_train, y_train, X_val= x_validation, Y_val= y_validation)

# genereate predicted values
y_preds_2= ngb_2.predict(x_validation)
# scale back to original representation
y_preds_2= (y_preds_2 * std_scaler[-1]) + mean_scaler[-1]       
# get their respective mean and standard deviation values
y_pred_dists_2= ngb_2.pred_dist(x_validation)
# compute median absolue error
median_abs_error_2= median_absolute_error(y_validation, y_preds_2)
print('median absolute error ngboost ', median_abs_error_2)    
# compute negative log likelihood
nll_2= - y_pred_dists_2.logpdf(y_validation).mean()
print('negative log likelihood ngboost ', nll_2)

# using staged prediction to compare best validation estimator against using all estimators
y_preds_2_best_itr= ngb_2.staged_predict(x_validation, max_iter= ngb_2.best_val_loss_itr)    
# scale back to original representation
y_preds_2_best_itr= (y_preds_2_best_itr[-1] * std_scaler[-1]) + mean_scaler[-1]    
# get their respective mean and standard deviation values
y_pred_dists_2_best_itr= ngb_2.staged_pred_dist(x_validation, max_iter= ngb_2.best_val_loss_itr)    
# compute median absolue error
median_abs_error_2_best_itr= median_absolute_error(y_validation, y_preds_2_best_itr)
print('median absolute error ngboost best itr ', median_abs_error_2_best_itr)    
# compute negative log likelihood
nll_2_best_itr= - y_pred_dists_2_best_itr[-1].logpdf(y_validation).mean()
print('negative log likelihood ngboost best otr ', nll_2_best_itr)

# plot dependent variable and its corresponding predicted data
fig, ax= plt.subplots(nrows= 1, ncols= 2)    
ax[0].plot(range(0,len(y_validation)), y_validation, '-k')    
ax[0].plot(range(0,len(y_validation)), y_preds_2, '--r')    
ax[0].set_title("NGBOOST \n MedianAbsError {:.4f}".format(median_abs_error_2))

ax[1].plot(range(0,len(y_validation)), y_validation, '-k')    
ax[1].plot(range(0,len(y_validation)), y_preds_2_best_itr, '--r')    
ax[1].set_title("NGBOOST best itr: \n MedianAbsError {:.4f}".format(median_abs_error_1))

plt.savefig("comparison_xgboost_ngboost.png", dpi= 600)    
plt.show()

###########################

If I print the best iteration and its associated LogScore on validation data, I get 7 and 265.8766

But. If I print all LogScore values on training data. These values go from 1.4394 down to 0.0051 and continue from -0.0004 up to -0.4221. What is the meaning of negative LogScore outputs? Why do I get both positive and negative values?

On the other hand, the print of all LogScore values on validation data. These values go from 272.3162 up to 7855.0612 (with a minimum value at 8th position). Such big values imply that the probability is very low. What does this mean?

Since, the natural log is used to improve the results at every new added estimator. Why the LogScore values behave like this?

The last part of the Python code, I generate a plot to compare the predicted values using all estimators versus the predicted values only obtained with the best iteration. According to this plot, the best solution is the one with all estimators. I believe this in turn defeats the way the best iteration is estimated.

Looking forward for you comments and clarifications.

Kind regards, Ivan

ivan-marroquin commented 3 years ago

any comments/suggestions?

alejandroschuler commented 3 years ago

hey @ivan-marroquin I'm not totally sure about this and I don't have time to test your example but I think you're just observing the difference between validation and training error- the training error will always be better and keep improving as the model complexity increases, whereas validation error decreases and then starts increasing. This is due to overfitting.

The negative values of the (negative or positive) log likelihood are to be expected. The likelihood can be any positive number, so the log will be negative when the number is <1 and positive when >1.

ivan-marroquin commented 3 years ago

Hi @alejandroschuler

Thanks for your reply. I think that there is a much deeper issue. As you pointed out, the LogScore is the log likelihood of the probability.

Regarding validation data: the measured probability is so close to 0, and this explains the relative high LogScore values.

Regarding training data: the measured probability (almost half-way of the number of estimators) varies between 0 to 1. But then, the probability becomes higher than1. Of course, this explains the obtained LogScore values

I am used your sample script, these results don't make sense. How is it normal to expect a probability so close to 0? Or even, to have probabilities higher than 1? I can't even plot LogScore values from validation and training to determine whether I have an overfitted model.

Hopefully, you will find sometime to have a closer look to this issue.

Ivan

alejandroschuler commented 3 years ago

Good question. The likelihood is just a view of the density function from the perspective of the parameters. The density for a discrete variable is the same as a probability. But for a continuous variable the density is not a probability. Consider a few examples: (a) X ~ N(0,1). What is the probability that X=2? It is 0. In fact, the probability that X=x for any x is... 0! (b) consider Y ~ Uniform(0, 1/2). What is the the value of the density function at y=1/4? It's 2. In fact, the density at Y=y is 2 for all y between 0 and 1/2. The point is that the density (and therefore likelihood) is unintuitively not the same thing as a probability. The correct interpretation is as a notion of relative probability.

None of that is to say that there isn't something off with the model selection across boosting stages! But I can't see that from your example. Can you plot training set negative log-likelihood AND validation set negative log-likelihood as a function of the number of trees? e.g. And then also report the value of best_val_loss_itr for the fit model?

ivan-marroquin commented 3 years ago

Hi @alejandroschuler

Thanks for reminding me about probability and continuous data. As per your request, I send you two graphics: 1) both train and validation in the same plot for all estimators, 2) train and validation on separate plots for all estimators

the reported best_val_loss_itr is: 8 with a logscore on validation data of 265.547050

Hope this helps,

Ivan

train_validation_score_curves.zip

alejandroschuler commented 3 years ago

Hi @alejandroschuler

Thanks for reminding me about probability and continuous data. As per your request, I send you two graphics: 1) both train and validation in the same plot for all estimators, 2) train and validation on separate plots for all estimators

the reported best_val_loss_itr is: 8 with a logscore on validation data of 265.547050

Hope this helps,

Ivan

train_validation_score_curves.zip

@ivan-marroquin this looks like a classic case of overfitting, so I think everything is working exactly as intended in terms of model selection. On the other hand, I don't know why the generalization is so bad on this data. I would try using an even smaller learning rate (maybe 0.0001?) and see if that changes the curves. It does certainly seem odd to me that the scale of both curves is so different- for the model with 0 or 1 trees they should be on par with each other because basically nothing has been learned, so that's definitely suspicious... One thing I noticed above is that you're comparing the predictions in the original scale to the rescaled data in the calculation of the median absolute error, which isn't right, but that shouldn't affect the way you're calculating negative log likelihood.

ivan-marroquin commented 3 years ago

I found the source of the problem with LogScore.

Note that I used the same script and dataset shown in ngboost's github main page. The big difference was that I decided to normalize both dependent/independent features prior to run ngboost.

The fix consists of either not to normalize the data (as it is shown on the github page) or just normalize the independent features. In the attached figure, I opted for the second option. Now, it is much clear that there is no need to have more than 200 estimators to get a good regressor model.

However, I believe that the issue still remains: why the LogScore "breaks down" when all features are normalized?

Ivan

train_validation_score_curves (2).zip

MikeOMa commented 3 years ago

Hey,

However, I believe that the issue still remains: why the LogScore "breaks down" when all features are normalized?

I was quite surprised when I read this so decided to look a little further, I agree it would be worrying. My understanding is that in theory regression trees should be completely independent of scaling of X. NGboost only uses X through the learner which is a regression tree in this case so I would expect the results not to change with scaling of X.

The actual magnitude of the log score can change with scaling of Y. For the Normal distribution and log score the difference should be log(y_std_scaler) (I think). Where y_std_scaler is the value the centred y is divided by.

I modified your code using the 4 combinations of (x, x_scaled) with (y, y_scaled) and plotted the results below

ScoreCurve

I put the variance of the training data in the plot titles so we can see which ones are standardized. In this example the log(std_scaler) is 2.217 which seems to match up with the gap in the plot.

Are these similar to the results you were getting? I can send on the code I used for this if you like!

ivan-marroquin commented 3 years ago

Hi @MikeOMa

Thanks for the follow up. Although I am no longer able to duplicate the initial problem that I reported, I am able to duplicate the issue your noticed when both x and y are normalized. For some reason, the LogScore on training data set becomes negative.

I attached my result with companion script.

Ivan

problematic_ngboost.zip

MikeOMa commented 3 years ago

Hey,

I think having a negative LogScore is perfectly expected. The value logscore prints is the average value of the negative log probability density function (-logpdf) of the data. Then as @alejandroschuler said:

The negative values of the (negative or positive) log likelihood are to be expected. The likelihood can be any positive number, so the log will be negative when the number is <1 and positive when >1.

Here's an example of both a positive and negative logpdf evaluation

from scipy.stats import norm
x=0
d = norm(0,1)
print(d.pdf(x))
print(d.logpdf(x))
d = norm(0,0.01)
print(d.pdf(x))
print(d.logpdf(x))

Mike

EDIT: Oops I said pdf instead of logpdf in that sentence before the code.

ivan-marroquin commented 3 years ago

Hi @MikeOMa

thanks for the clarification. If I got this well, there are two computed pdf: one on train data and another on validation data.

for future reference, I found this discussion in stackexchange that can be useful: https://stats.stackexchange.com/questions/140463/can-the-likelihood-take-values-outside-of-the-range-0-1

stanfordmlgroup / ngboost

Question on LogScore values on training and validation sets #247