sryza / spark-timeseries

A library for time series analysis on Apache Spark
Apache License 2.0
1.19k stars 424 forks source link

Lagging irregular time series #77

Open SimonOuellette35 opened 8 years ago

SimonOuellette35 commented 8 years ago

I am now faced with the problem of regressing on lags of a time series that is irregular. So next I will tackle the problem of lagging an irregular time series. I'd be curious to see if anyone has ideas or opinions on how to handle this problem.

A high-level overview of my proposal at the moment is as follows.

1 - The caller of the function specifies the base frequency to use. We can later add a function that automatically discovers the base frequency by looking at the minimal time gap that exists in the time series. But it's good in any case to have a manual override, so I will start there.

2 - The caller will provide an interpolation function that tells us what to do when there is no value at time (t - lag). This function will have the form (prev, next) => Double, where prev and next will represent the values surrounding time (t - lag) on the lower and upper boundaries.

3 - The resulting lags will look something like this if we specify lag periods of 0.1 and 0.2:

Time, Value 0.1, 100 0.2, 200 0.24, 123 0.25, 300 0.4, 400 0.5, 500

becomes:

Time, Lag0.1(Value), Lag0.2(Value) 0.3, 200, 100 0.4, f(300, 400), 200 0.5, 400, f(300, 400)

where of course f is the user-provided interpolation function.

Notice that the time instant 0.3 gets created even though it didn't exist in the previous dataset. Also, value (0.24, 123) gets completely ignored: there is loss of information because the specified base frequency is lower than the "true" base frequency of the dataset.

SimonOuellette35 commented 8 years ago

The signature of the lags function for Irregular datetime indexes should probably look like:

lags[U](maxLag: Int, lagPeriod: Period, includeOriginals: Boolean, laggedKey: %28K, Int%29 => U = laggedPairKey[K]_, interpolation: %28Double, Double%29 => Double): TimeSeries[U]

I will later create an equivalent for the lagsPerColumn as well.

Note: the date/time arithmetic to fetch the lags will have to be DST-aware (thus use the nscala-time concepts that are DST-aware).

SimonOuellette35 commented 8 years ago

Correction for the above post (the function signature didn't show up correctly):

lags[U: ClassTag]( maxLag: Int, lagFrequency: Frequency, includeOriginals: Boolean, laggedKey: (K, Int) => U = laggedPairKey[K]_, interpolation: (Double, Double) => Double) : TimeSeries[U]

SimonOuellette35 commented 8 years ago

@sryza let me know if you have any thoughts on this

SimonOuellette35 commented 8 years ago

Although now I'm still facing that annoying

"Error:(24, 7) in class TimeSeries, multiple overloaded alternatives of method lags define default arguments. class TimeSeries[K](val index: DateTimeIndex, val data: DenseMatrix[Double], ^"

compilation error... looks like I may have to give this function a unique name, such as lagIrregular

SimonOuellette35 commented 8 years ago

Signature for the "perColumn" version:

lagsPerColumnIrregular[U: ClassTag]( lagsPerCol: Map[K, (Boolean, Int)], lagFrequency: Frequency, includeOriginals: Boolean, laggedKey: (K, Int) => U = laggedPairKey[K]_, interpolation: (Double, Double) => Double) : TimeSeries[U]

SimonOuellette35 commented 8 years ago

I'm realizing the interpolation function signature I proposed above is too narrow to accomodate most useful cases. All it can accomplish is zero-order hold and averaging of the 2 values...

A more useful linear interpolation would need to know about the datetimes corresponding to the 2 boundary values. But even so, most non-trivial interpolation methods require much more than the 2 neighbouring points.

ahmed-mahran commented 8 years ago

In order to resolve this annoying signature issue, I suggest removing the default value from the original lags method

Original lags method

def lags[U: ClassTag](maxLag: Int, includeOriginals: Boolean,
      laggedKey: (K, Int) => U = laggedPairKey[K]_)
    : TimeSeries[U] = {

lags method after removing the default value

def lags[U: ClassTag](maxLag: Int, includeOriginals: Boolean,
 laggedKey: (K, Int) => U): TimeSeries[U] = {

and provide another lags overload without the laggedKey parameter that calls the original lags using laggedPairKey

def lags[U: ClassTag](maxLag: Int, includeOriginals: Boolean): TimeSeries[U] = {
  lags[U](maxLag, includeOriginals, laggedPairKey[K])
}
ahmed-mahran commented 8 years ago

How do you choose the first entry in the lagged index (e.g. 0.3 in your example)? I think that value affects the lagged time series considerably.

I can see that the lagged time series would have a uniform index. So, the irregular time series is implicitly being converted into a uniform time series. Why wouldn't we provide a toUniform utility that converts irregular indexes to uniform? and maybe irregular lags to use this utility internally?

sryza commented 8 years ago

I agree with @ahmed-mahran regarding uniform vs. irregular. lag for an IrregularDateTimeIndex should just use the previous element in the sequence of datetimes composing the index, i.e. it shouldn't implicitly overlay a uniform index. But there should be an easy way to get from an irregular index to a uniform index if that's the user's desire.

SimonOuellette35 commented 8 years ago

Regarding the uniform vs irregular -- fair enough. I suppose I'll call my function toUniform(), as this is the current use case I have. It's true that in a sense what I'm describing is more than just lagging, it's also converting to a uniform time series (explicitly, in fact, because when I instantiate the TimeSeries to be returned, I pass in a new UniformDatetimeIndex instance).

As for lagging irregular time series in an irregular way, I'm not clear what is the "meaning" of that. What does it really "mean" to lag by 0.4 seconds just because the previous tick was 0.4 seconds ago, and then by 1 second because the second previous tick was 1 second ago. I have a hard time imagining a context in which that makes sense, to be honest. Do you guys have any examples?

SimonOuellette35 commented 8 years ago

To answer @ahmed-mahran regarding choosing the first value (0.3), what I do is take the minimum (most anterior) time instant that exists in the series, and I add to it the maximum lag period that is specified by the user (so 0.1 + 0.2 in my example). Does that make sense? Do you have alternative approaches in mind that could work better?

SimonOuellette35 commented 8 years ago

BTW there's a mistake in my perColumn signature function above: the includeOriginals parameter should be removed.

SimonOuellette35 commented 8 years ago

Another important point I just realized: the interpolation function should be specific to each individual variable within the multivariate time series.

For example, if my 2 individual time series are for the price of a stock and the volume traded for that stock, then the interpolation for the price might be zero-order hold (i.e. last traded price), but the interpolation for the volume would probably be a summation of the values since the last regularized tick (this is especially relevant if we are downsampling the time series)...

ahmed-mahran commented 8 years ago

Regarding lagging irregular ts in irregular way, I think the nearest thing to @sryza's idea is to add the lag value to each instant with which to expand the time index. However, that would lead to a collection of sparse time series with a lot of NA's. Applying this on your example,

if we specify lag periods of 0.1 and 0.2:

Time Value
0.1 100
0.2 200
0.24 123
0.25 300
0.4 400
0.5 500

becomes:

Time Lag0.1(Value) Lag0.2(Value)
0.2 100 NA
0.3 200 100
0.34 123 NA
0.35 300 NA
0.4 NA 200
0.44 NA 123
0.45 NA 300
0.5 400 NA
0.6 500 400
0.7 NA 500
ahmed-mahran commented 8 years ago

After that the user is free to carry out NA imputation, maybe time series compression. Then toUniform if desired.

An alternative solution is call toUniform then lag.

What do you think guys; I think we should open another separate issue for converting irregular to uniform?

souellette-faimdata commented 7 years ago

I implemented a function for irregular lagging based on a time frequency/period that preserves indices rather than resampling. However it builds on top of the function referenced in #183 so I will need to have these changes incorporated before submitting my solution to this for approval.