Closed D3F4LT4ST closed 4 years ago
Hmm, I'm not sure exactly what the issue is. What is on the x-axis of that plot? Could you describe the data a little bit or is it a public dataset you could share, perhaps?
One thing I notice is that you're using ngb = NGBRegressor(Base=DecisionTreeRegressor())
, which will let each tree grow as deep as it wants. Typically in boosting you constrain the depth of the trees to something small (<5, usually) and let the sum of trees do the work. That might be something.
Thank you for your reply! The dataset used was built from M5 Competition data. It has 1941 rows corresponding to 1941 days where each row contains the total unit sales for that day (y) and various features extracted from the provided calendar dataset. It is then split into training and validation sets such that the validation set contains the data for last 28 days. I have attached the dataset below.
df_trn, df_val = pd.read_pickle('dataset.pkl')
I have tried limiting the depth of the trees as you suggested, but it didn't have a visible impact. Here's the result I got after the execution of the code from my first comment with max depth of the base learner set to 5:
Follow-up: I was able to identify the cause of the issue. It appears that NGBoost expects the observations to be in standardized form, unlike other tree based models like LightGBM which handled the raw data really well. After performing the transformation I was able to get a pretty accurate forecast:
scaler = StandardScaler()
scaler.fit(df_trn['y'].values.reshape(-1, 1))
df_trn['y'] = scaler.transform(df_trn['y'].values.reshape(-1, 1))
ngb = NGBRegressor()
ngb.fit(df_trn[x_cols].values, df_trn['y'].values)
df_val['y_pred'] = scaler.inverse_transform(ngb.predict(df_val[x_cols].values)).reshape(-1)
[iter 0] loss=1.4189 val_loss=0.0000 scale=1.0000 norm=1.0000
[iter 100] loss=0.7564 val_loss=0.0000 scale=2.0000 norm=1.1780
[iter 200] loss=0.3069 val_loss=0.0000 scale=1.0000 norm=0.5042
[iter 300] loss=0.1152 val_loss=0.0000 scale=1.0000 norm=0.4907
[iter 400] loss=0.0048 val_loss=0.0000 scale=1.0000 norm=0.4905
df_val[['y_pred', 'y']].plot()
Maybe it is worth updating the user guide to emphasize the importance of standardizing?
I replaced the distribution to LogNormal and it fit nicely.
model = ngb.NGBRegressor(Base=DecisionTreeRegressor(), Dist=LogNormal)
@guyko81 Thank you for the suggestion! It did indeed produce a nice forecast even without standardization. I'm still curious about what is going wrong when using Normal distribution.
@D3F4LT4ST Don't forget that at NGBoost the function that the model minimise is not a simple L2 norm but a loglikelihood, which has a shape that is defined by the original assumption. I think as the actual distribution is closer to LogNormal the error around the actual prediction is asymmetric. The first step in NGBoost is to find the marginal parameters. However given a mu and sigma parameter after the initial step - which is the flat line - in case of the Normal distribution it's hard to move out from this "local" minima if the model expects a Normal distribution. If you check the average of the train data "y" you'll see that it's basically equal to the found mu parameter (the average is 34341). With more training steps it should end up in a better parameter set though. When I ran the model for 2000 rounds it started to create non-linear predictions. In case of LogNormal the error around the mean prediction is asymmetric so it's easier to "see" through the loglikelihood loss where to move. Therefore less training steps is enough to find good parameters.
Hello,
I am relatively new to ML and recently came across this library when looking for models suited for probabilistic forecasting.
I have experimented with NGBoost on data from M5 Competition for quite a while, trying out various parameters but unfortunately wasn’t able to get any meaningful predictions out of it.
As displayed in the code below, even a single base learner was able to make far more accurate forecast.
What could be the reason behind such poor accuracy ?