linkedin / greykite

A flexible, intuitive and fast forecasting library
BSD 2-Clause "Simplified" License
1.81k stars 106 forks source link

Greykite feature scaling #95

Closed dromare closed 1 year ago

dromare commented 1 year ago

I would like to know how Greykite handles the X matrix of features with regularized linear models (Ridge, Lasso, Elastic Net). In the literature we can read that scaling of the features is necessary for regularization to work.

Then, say time-series features like trend term, sine and cosine terms, autoregressive features etc. make up the X matrix, possibly with some other external regressor columns or not. At this point the time-series features are not scaled, i.e., we may have an estimated trend feature of magnitude 1, some sin() and/or cos() terms of magnitude 10, some others of magnitude 0.1, etc.

Is the X matrix at this point passed to sklearn fit() function like this: clf = RidgeCV(alphas=[1e-3, 1e-2, 1e-1, 1]).fit(X, y)

or is it scaled in Greykite code using StandardScaler() or MinMaxScaler() or something else, before passing it to fit() ?

( sklearn documentation:

sklearn.linear_model.RidgeCV(alphas=0.1, 1.0, 10.0, *, fit_intercept=True, **normalize=False**, scoring=None, cv=None, gcv_mode=None, store_cv_values=False, alpha_per_target=False)

normalize: bool, default=False This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm. If you wish to standardize, please use StandardScaler before calling fit on an estimator with normalize=False.

)

amyfei2015 commented 1 year ago

Thanks for the question!

Scaling is done with other Greykite code before the X matrix is passed to fit() in Greykite. (Specifically, the function normalize_df(ref) in greykite.common.features.normalize is called here).

When using the library, it can be specified with normalize_method under custom parameters. (Please refer to the document here on how to use this parameter.) The current options include "zero_to_one", "statistical", "minus_half_to_half" and "zero_at_origin". If None is passed in, no normalization will be performed. Please refer to the descriptions of normalize_df here on how each method works.

dromare commented 1 year ago

Thank you !

It looks like the normalized_method under custom parameters was introduced in Greykite v0.4, as seen in greykite/docs/0.3.0: image image

And not available in Greykite v0.3, as seen in greykite/docs/0.1.0: image

We will need to update our Greykite version to 0.4 then, thanks.

BUT let me add a couple more comments:

  1. The zero_to_one method is the default used by normalize_df, but it is called min_max in docs/0.3.0; I suppose the documentation needs to be updated
  2. What would be the best approach to follow when X contains a mix of continuous variables (say, time-series features) and categorical variables ? One may argue that the statistical method (StandardScaler) should be applied to the continuous features, while no method, or at most the zero_to_one method, should be applied to categorical variables
amyfei2015 commented 1 year ago

Hi! Thanks so much for the comments!

For 1:

Sorry I appended the wrong version above, the most current document is 0.4.0. This doesn’t include an introduction on the method of "zero_at_origin", but shall have the name changes fixed. We are actively working on keeping all documents updated.

I just fixed the link in the comment above. Thanks for pointing it out and sorry for the confusion!

For 2:

In Greykite, all categorical variables will use one-hot coding, and the columns shall only contain 0 and 1 (before fitting). We agree that it’s best practice to keep all categorical variables in this format, and that is why we set default to zero_to_one .

However, if you prefer to use statistical, you can go ahead with it. Functionally, it should not affect the prediction results much as these variables still contain only two values. It makes it had to read the coefficients though. Hope this helps!

dromare commented 1 year ago

Hi ! Thank you for the detailed explanation !

For 1: all clear.

For 2:

Let's say some external regressors are in binary form (0 and 1) then zero_to_one is the best practice. But some others may be numerical; time-series features are also numerical. For numerical features, best practice would be statistical. When fitting X, this may contains both types of features, so which normalize_method would be best practice ?

I can also see that the option to scale external regressors is still available: image

This means that external regressors may be subjected to double scaling, once when applying a non-None input__regressors_numeric__normalize__normalize_algorithm method, and a second time when applying a non-None normalize_method ?

amyfei2015 commented 1 year ago

Hi! Thanks for the follow-up!

In the case you mention, we would suggest fitting with both methods and see how it goes with your dataset. For categorical variables, zero_to_one would be the best practice but statistical would still work. For numerical methods, zero_to_one and statistical usually have similar effects but statistical might be more robust to outliers. If outliers are removed, zero_to_one should perform similarly for numerical variables.

Generally we expect two methods to yield similar performances. Let us know if there are anything else to consider and please feel encouraged to let us know if either works out for you. Thanks!

Regarding the question on the option to scale external regressors: You are correct that external regressors may be scaled twice if these variables are set and normalize_method has non-None value. Generally we encourage to leave input__regressors_numeric__normalize__normalize_algorithm as None and use the normalize_method.

Thanks and hope this answer helps!

dromare commented 1 year ago

Thanks, all clear now !

statistical would be recommended when the features can be assumed to have a Gaussian distribution, otherwise zero_to_one would be a safer choice. Outliers can always be treated at the preprocessing stage anyway, using the input__response__outlier__ (for time-series features) and input__regressors_numeric__outlier__ (for numeric external regressors) options.

Feel free to close this issue whenever you like, thank you again !