stanfordmlgroup / ngboost

Natural Gradient Boosting for Probabilistic Prediction
Apache License 2.0
1.64k stars 215 forks source link

About returning probabilistic predictions for a single test data point? #197

Closed david-MYS closed 3 years ago

david-MYS commented 3 years ago

Hi, By reading your official tutorial (https://stanfordmlgroup.github.io/ngboost/1-useage.html), I understand that NGBoost can return the predicted mean and standard deviation for a single test data point.

I was reading this post (https://towardsdatascience.com/ngboost-explained-comparison-to-lightgbm-and-xgboost-fda510903e53) and it is said NGBoost can return probabilistic predictions for a single test data point. However, the code in this post will have this error (https://github.com/kyosek/NGBoost-experiments/issues/1).

So I want to get your official response: is there a way that we can get the probabilistic predictions for a single data point (for the Regression task)? More specifically, for a certain input data, the model should make several predictions for this single input data point. Then the model will say 93% confident that it can predict value y and 58% confident that it can predict value z... Then we can get a plot for a single data point showing the confidence level of each prediction made for this single data point. Please advise whether there is any way we can do this or the post mentioned above is just misleading us. Thank you in advance for the help!

alejandroschuler commented 3 years ago

You certainly can do this for a single test point. I think the error you mention happens because the test values for the features are formatted as a single vector of shape (n_features,). What you need is a 2-dimensional array of shape (1, n_features). So, for instance, let's say you have a whole test set of shape (n_observations, n_features) called X_test and you'd like the prediction just for observation k. To get the distributional prediction for that point you'd have to say ngb.pred_dist(X_test[k:k+1, :]), not ngb.pred_dist(X_test[k, :]). The result will be a length-1 list of ngboost distribution objects, so if you want the object in the list you need to do my_prediction_result[0] to get it out. Alternatively, it might be more natural to predict for the whole test set and then extract the prediction for your point, i.e. nbg.pred_dist(X_test)[k]. The result of that should be a single distribution object.

The distribution object tells you everything you need about the distributional prediction for that point. To extract quantiles you can treat the ngboost distribution object as if it were a scipy distribution and use any of those methods (e.g. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html). I say quantiles because I think that's what you're looking for. The q% quantile represents the range of outcome values that the model expects to observe q% of the time, given the features. It is not the "confidence" that the model has in the point prediction. See: https://towardsdatascience.com/interpreting-the-probabilistic-predictions-from-ngboost-868d6f3770b2