koaning / scikit-lego

Extra blocks for scikit-learn pipelines.
https://koaning.github.io/scikit-lego/
MIT License
1.25k stars 117 forks source link

[FEATURE] Contribution: DateTime Periodicity Encoder #415

Open tbezemer opened 3 years ago

tbezemer commented 3 years ago

I've implemented a DateTimePeriodicityEncoder. It is a scikit-learn encoder for datetime features that uses sine and cosine transformations to capture periodicity in datetimes. This type of transformation ensures that an algorithm can learn that 23 hours is close to 00 hours, minute 60 is close to minute 1, etc.

It can be used to capture different "aspects" of a datetime (e.g. minute-in-hour, hour-of-day, day-of-week, day-of-month) as such:

dpe1 = DateTimePeriodicityEncoder(aspects=["second", "minute", "hour", "weekday", "day", "month"])
dpe2 = DateTimePeriodicityEncoder(["hour", "weekday"])
dpe3 = DateTimePeriodicityEncoder("weekday")

For each of the aspects, it returns two new columns containing the respective sine and cosine transformations.

I have written unit tests and it passes the scikit-learn check_estimator (with some tags).

@MBrouns asked me to create and issue and tag you, @koaning, to see if this could be a useful contribution for scikit-lego. If so, I can submit a pull request.

koaning commented 3 years ago

Having something that can pick up a datetime object sure sounds pratical to me. A lot of folks mentioned they were at times confused by the RadialBasis trick. Just so I understand the sine / cosine part: do you have a figure of the exact effect?

tbezemer commented 3 years ago

There you go! Does this image help?

datetimeperiodicity-2

koaning commented 3 years ago

Clear. But then I have one other question; is there a reason to create both the sine and cosine columns? Why both?

Also, got an exact use-case for this? One concern I have with this, compared to the RepeatingBasisFunctions is that it might only be able to describe one specific seasonality shape.

tbezemer commented 3 years ago

Each component captures 50% of the information. I attempted to illustrate it in the diagram below.

I don't completely understand what you mean by being able to only describe one specific seasonality? This can be applied to the different aspects I named in my description. Or is that not what you mean?

datetimeperiodicity-3

koaning commented 3 years ago

Let us consider the example from the docs.

When you use your tool to generate features for such a dataset. Does it really help a model? I'm wondering if your features can help in situations where the shape you're trying to fit is not a "perfect sine". That's what I mean with "a specific seasonality". Most seasonal patterns that I've seen don't really fit a sine wave. It's usually something like "high in summer, zero in winter" or something else that's smoothly repeating ... but not a sine wave.

tbezemer commented 3 years ago

But is this sine wave you show above not showing some other variable that moves in time? This encoding I suggest is about encoding time itself to capture information about when in a particular cycle something happened. It is a more appropriate alternative for simply extracting the "hour of day" or "day of the week" as an integer from a time stamp. Therefore, my use of this would be more for flat datasets as opposed to time series. I think indeed for time series the application is limited. A more concrete use case: a security system logs events with a particular type and we might want to our model to learn which type of events occur late at night or in the wee hours of night. 1-24 would not suffice here.

MBrouns commented 3 years ago

So the thing that @tbezemer has here is an order 1 fourier series, which is indeed not super expressive in and of itself. That said, an order 2 fourier series can easily replicate @koaning s pattern above:

image

Maybe it makes sense to include the order as a hyperparameter. That way it is also more in line with the RBF encoder that we already have? I would consider renaming it to FourierSeriesEncoder though to make it more clear how it incodes, rather than what it encodes

koaning commented 3 years ago

@MBrouns that was indeed the direction I was thinking of. Doing that seems very sensible and I'd certainly welcome a PR with that feature.

koaning commented 3 years ago

There seems to be a bit of radio silence. @tbezemer are you still interested in implementing this?

tbezemer commented 3 years ago

Yes, definitely! I chatted with Matthijs about this a week ago. It's high up on my to do list. To be continued.

tbezemer commented 3 years ago

I have extended the transformer to accept a non-zero, positive integer parameter n_periods (in accordance with the RepeatingBasisFunction transformer). For each period from 1 to 'n', a set of two columns (one sine and one cosine transformation) will be produced, where 'n ' divides the periodicity of the aspect: sin(val 2pi / (periodicity / n)) cos(val 2pi / (periodicity / n))

So for the aspect 'hour', for n_periods = 2 of a datetime with 'hour' component '1am', the respective calculations will be as follows: Period 1: sin(1.0 2pi / (24.0 / 1)) cos(1.0 2pi / (24.0 / 1)) Period 2: sin(1.0 2pi / (24.0 / 2)). cos(1.0 2pi / (24.0 / 2)).

Is this what you meant?

koaning commented 3 years ago

Is this what you meant?

@tbezemer who are you referring to here @MBrouns or myself?

tbezemer commented 3 years ago

Both of you since you seemed to be on the same page about extending the transformer in this way, but @MBrouns and I specifically discussed this part, so perhaps it is easier for him to way in on this?

koaning commented 3 years ago

I think I'm cool with having a transformer for periodicity but before you start the PR we can save a lot of review time if we can discuss the signatures of the transformer here first. Mainly on my end; @tbezemer could you describe the full input of the object? Maybe list a few examples that demonstrate the main usecases? Once we're agreed on that the implementation should be very straightforward.

tbezemer commented 3 years ago

Certainly!:

This transformer allows a user to decompose timestamps into their sine and cosine components for each aspect of this timestamp (second, minute, hour, day (in month), weekday, month), and allows the user to further finetune periodicity within this aspect by specifying n_periods. In other words, this transformer allows us to easily specify on what level/aspect of our timestamp we are interested in learning periodicities, and n_periods allows us to fine tune periodicity within this aspect.

def __init__(self, aspects=None, n_periods=1):
    """- aspects can be one of : ["second", "minute", "hour", "weekday", "day", "month"]. 
        If None specified, the whole list is used.
        - n_periods is a non-zero and positive integer. For each period 'p' from 1 to n_periods,
        a new set of sine/cosine transformations is produced for each of the passed aspects, 
       having periodicity aspect_periodicity / p"""

def fit(self, X, y=None):
    """Fit function only sets trailing underscore variables and saves shape of X
        (e.g. self.n_periods_ and self.aspects_"""

def transform(self, X, y=None):
    """- Where X is an np.array. Otherwise assumes a pd.Dataframe and tries to extract X.values
    - X is then checked for conformity to the expected datatype: datetime64.
    - Applies an extractPeriodicFeatures(...) function to each column in the array.
    - returns transformed X where X consists of pairs of sine/cosine transformations
    for each aspect, for each period from 1 to self.n_periods_"""

Use case: We have an event log with a time stamp column for which we are trying to predict some outcome variable. We want our model to be able to learn about potential periodicities in this timestamp. For instance, it may be that some type of event mainly occurs at night. Using 24 hour time, night could range from for instance 21:00 to 5:00. Sine and cosine transformations capture that 23:00 and 01:00 are still close to each other on the clock, even though 1 and 23 are almost maximally distanced from each other on the 1-24 ordinal scale that a model may assume when simply passing the hour as an integer.

In my opinion, the added value of this transformer is:

Let me know if you need for information!

koaning commented 3 years ago

A few points on my end.

  1. Will you be able to pass in multiple aspects, say PeriodicityEncoder(aspects=['day', 'hour'])? You could also use a FeatureUnion but is there a use-case where you take one datetime in and you get an array of 4 columns (2 corresponding to day and the other two to hour) out instead of two?
  2. I'm wondering if "aspects" is the best name here. Might "feature" be preferable?
  3. Do we really need copies of variables to be created in .fit()? The main thing that needs to happen there is checking if the data format is correct. If nothing is really learned then simply setting a self.fitted_ = True seems sufficient.
  4. I'm not 100% sure that the n_periods input variable does. Could you perhaps rephrase what that parameter does?
tbezemer commented 3 years ago

Ad 1. Yes, indeed! You can pass a list as well. FeatureUnion could be an alternative but because these transformations all pertain to the same column of datetimes, so I think it makes more sense to extract them in one go instead of having to repeat it manually for each aspect. This way, you could also grid search with different combinations of aspects without changing your preprocessing pipeline. If you disagree, I can rewrite it to only allow a single aspect.

Ad 2. We can definitely change aspects to features!

Ad 3. Yeah, that is definitely another way of doing it. I thought that the sklearn convention was to only use trailing underscore variables in the transform step, to ensure that attributes have not changed since first fit, but I see how copying them into a differently named attribute seems a bit redundant. I can change that as per your suggestion!

Ad 4. I hope this image makes it clearer. Forgive my poor handwriting :) . @MBrouns suggested this to me to also allow for higher frequency effects within the total period of each aspect. That, or maybe I horribly misunderstood what he meant, haha.

transformer