stanfordmlgroup / ngboost

Natural Gradient Boosting for Probabilistic Prediction
Apache License 2.0
1.64k stars 215 forks source link

NGBoost fails to produce accurate predictions #135

Closed D3F4LT4ST closed 4 years ago

D3F4LT4ST commented 4 years ago

Hello,

I am relatively new to ML and recently came across this library when looking for models suited for probabilistic forecasting.

I have experimented with NGBoost on data from M5 Competition for quite a while, trying out various parameters but unfortunately wasn’t able to get any meaningful predictions out of it.

As displayed in the code below, even a single base learner was able to make far more accurate forecast.

What could be the reason behind such poor accuracy ?

ngb = NGBRegressor(Base=DecisionTreeRegressor())
ngb.fit(df_trn[x_cols].values, df_trn['y'].values)
df_val['y_pred_ngb'] = ngb.predict(df_val[x_cols].values).reshape(-1)
[iter 0] loss=10.3206 val_loss=0.0000 scale=0.2500 norm=2.6698
[iter 100] loss=10.1343 val_loss=0.0000 scale=0.5000 norm=4.7316
[iter 200] loss=10.0357 val_loss=0.0000 scale=0.2500 norm=2.5211
[iter 300] loss=9.9927 val_loss=0.0000 scale=0.2500 norm=2.6747
[iter 400] loss=9.9541 val_loss=0.0000 scale=0.2500 norm=2.8595
dt = DecisionTreeRegressor()
dt.fit(df_trn[x_cols].values, df_trn['y'].values)
df_val['y_pred_dt'] = dt.predict(df_val[x_cols].values).reshape(-1)
df_val[['y_pred_ngb', 'y_pred_dt']].plot()
df_trn[-50:].append(df_val)['y'].plot()

Screen Shot 2020-06-18 at 7 55 45 PM

alejandroschuler commented 4 years ago

Hmm, I'm not sure exactly what the issue is. What is on the x-axis of that plot? Could you describe the data a little bit or is it a public dataset you could share, perhaps?

One thing I notice is that you're using ngb = NGBRegressor(Base=DecisionTreeRegressor()), which will let each tree grow as deep as it wants. Typically in boosting you constrain the depth of the trees to something small (<5, usually) and let the sum of trees do the work. That might be something.

D3F4LT4ST commented 4 years ago

Thank you for your reply! The dataset used was built from M5 Competition data. It has 1941 rows corresponding to 1941 days where each row contains the total unit sales for that day (y) and various features extracted from the provided calendar dataset. It is then split into training and validation sets such that the validation set contains the data for last 28 days. I have attached the dataset below.

df_trn, df_val = pd.read_pickle('dataset.pkl')

I have tried limiting the depth of the trees as you suggested, but it didn't have a visible impact. Here's the result I got after the execution of the code from my first comment with max depth of the base learner set to 5: ngboost_issue_plot

dataset.zip

D3F4LT4ST commented 4 years ago

Follow-up: I was able to identify the cause of the issue. It appears that NGBoost expects the observations to be in standardized form, unlike other tree based models like LightGBM which handled the raw data really well. After performing the transformation I was able to get a pretty accurate forecast:

scaler = StandardScaler()
scaler.fit(df_trn['y'].values.reshape(-1, 1))
df_trn['y'] = scaler.transform(df_trn['y'].values.reshape(-1, 1))

ngb = NGBRegressor()
ngb.fit(df_trn[x_cols].values, df_trn['y'].values)
df_val['y_pred'] = scaler.inverse_transform(ngb.predict(df_val[x_cols].values)).reshape(-1)
[iter 0] loss=1.4189 val_loss=0.0000 scale=1.0000 norm=1.0000
[iter 100] loss=0.7564 val_loss=0.0000 scale=2.0000 norm=1.1780
[iter 200] loss=0.3069 val_loss=0.0000 scale=1.0000 norm=0.5042
[iter 300] loss=0.1152 val_loss=0.0000 scale=1.0000 norm=0.4907
[iter 400] loss=0.0048 val_loss=0.0000 scale=1.0000 norm=0.4905
df_val[['y_pred', 'y']].plot()

ngboost_issue_plot_2 Maybe it is worth updating the user guide to emphasize the importance of standardizing?

guyko81 commented 4 years ago

I replaced the distribution to LogNormal and it fit nicely.

model = ngb.NGBRegressor(Base=DecisionTreeRegressor(), Dist=LogNormal)

D3F4LT4ST commented 4 years ago

@guyko81 Thank you for the suggestion! It did indeed produce a nice forecast even without standardization. I'm still curious about what is going wrong when using Normal distribution.

guyko81 commented 4 years ago

@D3F4LT4ST Don't forget that at NGBoost the function that the model minimise is not a simple L2 norm but a loglikelihood, which has a shape that is defined by the original assumption. I think as the actual distribution is closer to LogNormal the error around the actual prediction is asymmetric. The first step in NGBoost is to find the marginal parameters. However given a mu and sigma parameter after the initial step - which is the flat line - in case of the Normal distribution it's hard to move out from this "local" minima if the model expects a Normal distribution. If you check the average of the train data "y" you'll see that it's basically equal to the found mu parameter (the average is 34341). With more training steps it should end up in a better parameter set though. When I ran the model for 2000 rounds it started to create non-linear predictions. In case of LogNormal the error around the mean prediction is asymmetric so it's easier to "see" through the loglikelihood loss where to move. Therefore less training steps is enough to find good parameters.