Open tbezemer opened 3 years ago
Having something that can pick up a datetime object sure sounds pratical to me. A lot of folks mentioned they were at times confused by the RadialBasis trick. Just so I understand the sine
/ cosine
part: do you have a figure of the exact effect?
There you go! Does this image help?
Clear. But then I have one other question; is there a reason to create both the sine and cosine columns? Why both?
Also, got an exact use-case for this? One concern I have with this, compared to the RepeatingBasisFunctions is that it might only be able to describe one specific seasonality shape.
Each component captures 50% of the information. I attempted to illustrate it in the diagram below.
I don't completely understand what you mean by being able to only describe one specific seasonality? This can be applied to the different aspects I named in my description. Or is that not what you mean?
Let us consider the example from the docs.
When you use your tool to generate features for such a dataset. Does it really help a model? I'm wondering if your features can help in situations where the shape you're trying to fit is not a "perfect sine". That's what I mean with "a specific seasonality". Most seasonal patterns that I've seen don't really fit a sine wave. It's usually something like "high in summer, zero in winter" or something else that's smoothly repeating ... but not a sine wave.
But is this sine wave you show above not showing some other variable that moves in time? This encoding I suggest is about encoding time itself to capture information about when in a particular cycle something happened. It is a more appropriate alternative for simply extracting the "hour of day" or "day of the week" as an integer from a time stamp. Therefore, my use of this would be more for flat datasets as opposed to time series. I think indeed for time series the application is limited. A more concrete use case: a security system logs events with a particular type and we might want to our model to learn which type of events occur late at night or in the wee hours of night. 1-24 would not suffice here.
So the thing that @tbezemer has here is an order 1 fourier series, which is indeed not super expressive in and of itself. That said, an order 2 fourier series can easily replicate @koaning s pattern above:
Maybe it makes sense to include the order as a hyperparameter. That way it is also more in line with the RBF encoder that we already have? I would consider renaming it to FourierSeriesEncoder
though to make it more clear how it incodes, rather than what it encodes
@MBrouns that was indeed the direction I was thinking of. Doing that seems very sensible and I'd certainly welcome a PR with that feature.
There seems to be a bit of radio silence. @tbezemer are you still interested in implementing this?
Yes, definitely! I chatted with Matthijs about this a week ago. It's high up on my to do list. To be continued.
I have extended the transformer to accept a non-zero, positive integer parameter n_periods
(in accordance with the RepeatingBasisFunction
transformer). For each period from 1 to 'n', a set of two columns (one sine and one cosine transformation) will be produced, where 'n ' divides the periodicity of the aspect:
sin(val 2pi / (periodicity / n))
cos(val 2pi / (periodicity / n))
So for the aspect 'hour', for n_periods = 2 of a datetime with 'hour' component '1am', the respective calculations will be as follows: Period 1: sin(1.0 2pi / (24.0 / 1)) cos(1.0 2pi / (24.0 / 1)) Period 2: sin(1.0 2pi / (24.0 / 2)). cos(1.0 2pi / (24.0 / 2)).
Is this what you meant?
Is this what you meant?
@tbezemer who are you referring to here @MBrouns or myself?
Both of you since you seemed to be on the same page about extending the transformer in this way, but @MBrouns and I specifically discussed this part, so perhaps it is easier for him to way in on this?
I think I'm cool with having a transformer for periodicity but before you start the PR we can save a lot of review time if we can discuss the signatures of the transformer here first. Mainly on my end; @tbezemer could you describe the full input of the object? Maybe list a few examples that demonstrate the main usecases? Once we're agreed on that the implementation should be very straightforward.
Certainly!:
This transformer allows a user to decompose timestamps into their sine and cosine components for each aspect
of this timestamp (second, minute, hour, day (in month), weekday, month), and allows the user to further finetune periodicity within this aspect by specifying n_periods
. In other words, this transformer allows us to easily specify on what level/aspect of our timestamp we are interested in learning periodicities, and n_periods allows us to fine tune periodicity within this aspect.
def __init__(self, aspects=None, n_periods=1):
"""- aspects can be one of : ["second", "minute", "hour", "weekday", "day", "month"].
If None specified, the whole list is used.
- n_periods is a non-zero and positive integer. For each period 'p' from 1 to n_periods,
a new set of sine/cosine transformations is produced for each of the passed aspects,
having periodicity aspect_periodicity / p"""
def fit(self, X, y=None):
"""Fit function only sets trailing underscore variables and saves shape of X
(e.g. self.n_periods_ and self.aspects_"""
def transform(self, X, y=None):
"""- Where X is an np.array. Otherwise assumes a pd.Dataframe and tries to extract X.values
- X is then checked for conformity to the expected datatype: datetime64.
- Applies an extractPeriodicFeatures(...) function to each column in the array.
- returns transformed X where X consists of pairs of sine/cosine transformations
for each aspect, for each period from 1 to self.n_periods_"""
Use case: We have an event log with a time stamp column for which we are trying to predict some outcome variable. We want our model to be able to learn about potential periodicities in this timestamp. For instance, it may be that some type of event mainly occurs at night. Using 24 hour time, night could range from for instance 21:00 to 5:00. Sine and cosine transformations capture that 23:00 and 01:00 are still close to each other on the clock, even though 1 and 23 are almost maximally distanced from each other on the 1-24 ordinal scale that a model may assume when simply passing the hour as an integer.
In my opinion, the added value of this transformer is:
Let me know if you need for information!
A few points on my end.
PeriodicityEncoder(aspects=['day', 'hour'])
? You could also use a FeatureUnion
but is there a use-case where you take one datetime in and you get an array of 4 columns (2 corresponding to day
and the other two to hour
) out instead of two? "aspects"
is the best name here. Might "feature"
be preferable? .fit()
? The main thing that needs to happen there is checking if the data format is correct. If nothing is really learned then simply setting a self.fitted_ = True
seems sufficient. n_periods
input variable does. Could you perhaps rephrase what that parameter does? Ad 1. Yes, indeed! You can pass a list as well. FeatureUnion
could be an alternative but because these transformations all pertain to the same column of datetimes, so I think it makes more sense to extract them in one go instead of having to repeat it manually for each aspect. This way, you could also grid search with different combinations of aspects without changing your preprocessing pipeline. If you disagree, I can rewrite it to only allow a single aspect.
Ad 2. We can definitely change aspects
to features
!
Ad 3. Yeah, that is definitely another way of doing it. I thought that the sklearn convention was to only use trailing underscore variables in the transform step, to ensure that attributes have not changed since first fit, but I see how copying them into a differently named attribute seems a bit redundant. I can change that as per your suggestion!
Ad 4. I hope this image makes it clearer. Forgive my poor handwriting :) . @MBrouns suggested this to me to also allow for higher frequency effects within the total period of each aspect. That, or maybe I horribly misunderstood what he meant, haha.
I've implemented a
DateTimePeriodicityEncoder
. It is a scikit-learn encoder for datetime features that uses sine and cosine transformations to capture periodicity in datetimes. This type of transformation ensures that an algorithm can learn that 23 hours is close to 00 hours, minute 60 is close to minute 1, etc.It can be used to capture different "aspects" of a datetime (e.g. minute-in-hour, hour-of-day, day-of-week, day-of-month) as such:
For each of the aspects, it returns two new columns containing the respective sine and cosine transformations.
I have written unit tests and it passes the scikit-learn
check_estimator
(with some tags).@MBrouns asked me to create and issue and tag you, @koaning, to see if this could be a useful contribution for
scikit-lego
. If so, I can submit a pull request.