skrub-data / skrub

Prepping tables for machine learning
https://skrub-data.org/
BSD 3-Clause "New" or "Revised" License
1.05k stars 91 forks source link

Add features to the `DatetimeEncoder` #907

Open koaning opened 1 month ago

koaning commented 1 month ago

Problem Description

Skrub currently encodes some features in the datetime encoder, but there are a few that feel missing.

It feels like these might be great candidates to consider for our DatetimeEncoder.

Feature Description

The seasonal patterns can both be generated using the SplineTransformer in scikit-learn under the hood that uses the periodic setting. There's a demo of this technique here. In one case we'd model the features over the ordinal day of the year, while in the other case we'd use time of day.

The holidays might be a bit trickier because we'd need to rely on a 3rd party library to capture all of them. Then again, polars does support some business day features, so we might just be able to leverage something there.

Alternative Solutions

I'm planning on making a first version of such a component so that it becomes easier to shoot at it. I can imagine that maybe we don't want to support all of these features but maybe just a subset. It can also be the case that business/holiday features should go into another estimator.

Additional Context

No response

jeromedockes commented 1 month ago

Thanks for opening this! Both hoidays and temporal patterns sound useful

holidays: As you say there is the problem that we would need another dependency to get the holidays. However maybe at first we could have the users provide their own holidays, similarly to what is done in the polars add_business_day for which you provided a link. We could also consider starting by adding an example showing how to use one of the joiner transformers to add that feature after constructing a table that lists all holidays with the python-holidays package.

seasonal patterns: in your experience, are those features mostly useful for linear models, or do they also improve the performance of gradient boosting? And what dimensionality do you think is typically useful?

Another addition I would like to see for the DatetimeEncoder is the option to output some of its current features as Categorical dtypes, or as one-hot encoded categories. In particular we may want to encode the day of the week and probably the month as categories rather than as a floating-point numbers (as is done at the moment).

Also note that important changes are made to the DatetimeEncoder in #902 (among others, adding support for polars and making it accept a single column rather than a dataframe). The first version you have in mind would be in the form of changes to the DatetimeEncoder, or a stand-alone prototype?

koaning commented 1 month ago

in your experience, are those features mostly useful for linear models, or do they also improve the performance of gradient boosting? And what dimensionality do you think is typically useful?

I can't recall a public benchmark that I can share, but I have heard of many anekdotes that folks have used this technique after I presented it at a PyData many years ago. I can't imagine why it wouldn't benefit an ensemble technique but there is some wiggle room here due to the n_knots parameter.

The first version you have in mind would be in the form of changes to the DatetimeEncoder, or a stand-alone prototype?

That depends on the preference of folks. The quickest way would be for me to build something solo and to maybe run a few benchmarks to confirm that it works for non-linear models as well. If we prefer a benchmark before doing the proper implementation here this could be a reasonable avenue to explore. Open to suggestions tho!

jeromedockes commented 1 month ago

I didn't have in mind the ensemble aspect but rather the fact that non-linear models might be able to cope better with the raw features by themselves. For example given the hour a linear model would need the splines or some other feature engineering to separate out the lunch break period, but a non-linear model such as gradient boosting could do it from the original feature. (they can be a good addition to the DatetimeEncoder in any case, I was just wondering if you had insights about the settings where these feaatures are most often used)

koaning commented 1 month ago

Ah good that you point that out, it's a subtle difference. I guess even with a non-linear model the featurization technique can also be seen as a way to steer the model. Kind of as a 'you can ignore these features, but it may be really helpful in getting good fit'-kind of way. I guess another benefit to mention is that the spline-y features are more smooth. So maybe less step functions in the output and more smooth predictions instead.

A lot of this would still depend on the hyperparameters tho. If there are no extra concerns I'll try to figure out some time to run some benchmarks. I think my Kaggle datasets have a few examples in them where this might be relevant.

jeromedockes commented 1 month ago

Ah good that you point that out, it's a subtle difference. I guess even with a non-linear model the featurization technique can also be seen as a way to steer the model. Kind of as a 'you can ignore these features, but it may be really helpful in getting good fit'-kind of way. I guess another benefit to mention is that the spline-y features are more smooth. So maybe less step functions in the output and more smooth predictions instead.

That's a good point.

A lot of this would still depend on the hyperparameters tho. If there are no extra concerns I'll try to figure out some time to run some benchmarks. I think my Kaggle datasets have a few examples in them where this might be relevant.

That's great, it will be super useful to get a sense of the settings where this boosts prediction and the accuracy vs time & memory tradeoffs. One more thing to look out for is that we don't support sparse data (because polars doesn't and most likely never will), and I guess depending on the chosen hyperparameters the dimensionality of the spline features can get really high.

The kaggle datasets sound great; another one that I was thinking could be useful for the example gallery is the bike rental one used in scikit-learn examples:

https://scikit-learn.org/stable/auto_examples/applications/plot_cyclical_feature_engineering.html

I was thinking we could use it to rewrite our current datetime encoder example:

https://skrub-data.org/stable/auto_examples/03_datetime_encoder.html#sphx-glr-auto-examples-03-datetime-encoder-py

which at the moment uses a dataset where the different datetime features don't bring a lot of information.

koaning commented 1 month ago

I have some results. I have a timeseries task with these contents in X:

CleanShot 2024-05-25 at 15 04 21

When I run this with a bunch of base settings, I see these CV results (the table is wide so you may need to zoom in).

CleanShot 2024-05-25 at 15 06 14

There are different algorithms (xgboost, lightgbm, sklearn hist boost and ridge) and different featurization settings (table vectorizer, table vectorizer that drops an id column and tablevectorizer with the seasonal date feature). Across all the algorithms it seems that adding the seasonal feature improves things. This improvement may not be incredibly substantial, but it doe seem consistent.

koaning commented 1 month ago

There is another one of these datasets that seems to have similar results.

CleanShot 2024-05-25 at 15 10 43

I want to add one caveat here because these datasets are synthetic. Kaggle reports that they are based on actual datasets but this benchmark is based on simulated data in the end. That said, the improvement again seems to be consistent.

CleanShot 2024-05-25 at 15 12 32

jeromedockes commented 1 month ago

We discussed it this morning during the skrub meeting (you're welcome to join whenever you want by the way, it's every Monday 10:30 to 11:00 in Europe/Paris, if you're interested I'll send you the link).

I think there is a consensus that skrub should provide the "seasonal patterns" you describe. However @GaelVaroquaux raised the point that splines can be a bit tricky to parametrize and we were wondering: in your experience are sine/cosine transforms easier to work with and how do they perform?

jeromedockes commented 1 month ago

In this example the sine features seem to perform worse than the splines or than simple one-hot encoding of the hour

koaning commented 1 month ago

The splines aren't perfect for sure, but they've thusfar always seemed simple enough and also pragmatic in the sense that it's a simple thing to reason about. I do recall that regularisation on the model that follows can be very good tho. I have never really tried the sine features because the spline trick always kind of worked pretty well for the season.

I guess a good next question is ... how might we want to implement this?

GaelVaroquaux commented 1 month ago

The splines aren't perfect for sure, but they've thusfar always seemed simple enough and also pragmatic in the sense that it's a simple thing to reason about. I do recall that regularisation on the model that follows can be very good tho.

I just worry that they are not a simple 2-liner implementation. Rather, it's hiding something with a lot of subtleties and corresponding hyperparameters in the DateTimeEncoder. I don't like that.

koaning commented 1 month ago

That's a fair concern, but I am not sure what the user might expect besides "sensible defaults". The most general seasonal pattern feels like it might be to do 'something something monthly', so maybe setting n_knots=12 is good enough?

Part of me worries that there may not something simpler to configure that those n_knots mainly because it does translate nicely to 'peakyness'. And also, we're just aiming for a sensible seasonality featurizer. Stuff like holidays, which could be seen as a sort of seasonal feature, is out of scope here.

jeromedockes commented 2 weeks ago

Stuff like holidays, which could be seen as a sort of seasonal feature, is out of scope here.

I agree, holiday/weekend are a different issue -- they're just a 1D indicator that says if each time point falls during a holiday. So let's discuss them in #710 instead, and focus on the splines/cyclical features here

jeromedockes commented 2 weeks ago

Discussing a bit with @ogrisel and @glemaitre we were thinking that for most things that are likely to be relevant, the shape of the splines that are flat with a peak will capture them more easily than sines. for example "lunch break" can be nicely captured by one spline with a width of roughly 1 h, whereas its representation in the frequency domain will have many coefficients. So with splines we may get away with smaller dimension, and have more interpretable models and defaults that are easier to set

jeromedockes commented 2 weeks ago

I also wonder if the current interface of the DatetimeEncoder is suitable for adding those features or if parameters should be in terms of "which cycles to represent" rather than the current "resolution"

koaning commented 2 weeks ago

I also wonder if the current interface of the DatetimeEncoder is suitable for adding those features or if parameters should be in terms of "which cycles to represent" rather than the current "resolution"

That's a good point. For my own tools sofar I've often resorted to an API similar to:

make_union(
  SeasonalFeaturizer(date_col="datetime", kind="hour_per_day", knots=24),
  SeasonalFeaturizer(date_col="date", kind="day_of_year", knots=12)
)

Something about doing multiple of 'em feels nice when you're doing things manually ... but there might be something that we can infer if the dataframe going in gives us a datetime vs. a date?