scikit-hep / awkward-0.x

Manipulate arrays of complex data structures as easily as Numpy.
BSD 3-Clause "New" or "Revised" License
215 stars 39 forks source link

Processing Time Series Data #232

Closed matteogales closed 4 years ago

matteogales commented 4 years ago

I am looking at using Awkward Array for processing time series data.

The data analysed may contain several features/attributes, and each feature may be measured at a different frequency and there may be a different number of measurement for each e.g. feature 'x' may be measured hourly but feature 'y' may be measured daily, and each time series contains the measurements for a given day or week.

There may not be a measurement at each time slot for each feature e.g. feature 'x' may correspond to the measurement from some sensor that provides a reading every 5 minutes over a given day, and perhaps the battery for the sensor was dead for an hour or so.

A set of time-series for all features may be grouped together as a "case", where each case may have (for classification problems) a corresponding classification e.g. x and y are features giving the position of different points on a person who is moving, and each set of measurements may be classified as representing a particular action or not. There may be several cases in a data set, and each case may not necessarily contain a time series for all features.

Given the above, I thought that using a scheme where each time series for a feature was represented as a SparseArray (so that I could differentiate between a lack of a data point and NaN), and the time series were stored in a Table where the values in a column represented the different time series for a feature (or classification, if appropriate), and the values in a row the time series for a particular case. If a feature was not present for a case, then None would be used.

Hopefully, the below code illustrates my initial thoughts.

import awkward
from awkward import SparseArray, Table
import numpy as np

# Define the time series for feature 'x'
# Each one corresponding to the measurement of x in a particular "case"

x_0 = SparseArray(4, [0, 1, 2, 3], [0.0, 1.0, 2.0, 3.0], default=np.NaN)
x_1 = SparseArray(3, [0, 1, 2], [2.0, 3.0, 4.0], default=np.NaN)
x_2 = SparseArray(4, [0, 1, 3], [3.0, 4.0, 6.0], default=np.NaN)
x_3 = None

# Define the time series for feature 'y'
# Each one corresponding to the measurement of x in a particular "case"

y_0 = SparseArray(3, [0, 1, 2], [4.0, 5.0, 6.0], default=np.NaN)
y_1 = SparseArray(3, [0, 1, 2], [5.0, 6.0, 7.0], default=np.NaN)
y_2 = SparseArray(3, [0, 1, 2], [6.0, 7.0, 9.0], default=np.NaN)
y_3 = SparseArray(3, [0, 1, 2], [6.0, 7.0, 9.0], default=np.NaN)

# Build the table to hold the time-series
# Columns hold the data for different instances of features, and rows hold cases

xs = awkward.fromiter([x_0, x_1, x_2, x_3])
ys = awkward.fromiter([y_0, y_1, y_2, y_3])
table = Table([xs, ys])

When preprocessing this data (i.e. before applying some classification or forecasting model), some transformations may be applied to each value, and I have found either the Numpy ufuncs already defined in Awkward very useful for this, as were the examples provided for designing a custom ufunc using Numba.

However, there are some occasions when I want to calculate some value from the entire data for some time-series e.g. using a simple linear regression for calculating the slope for all the time-series for a specified feature (i.e. across cases), and/or for all features in a given case.

I know that I can use the properties of the Table for iterating across a given row or column, or all rows/columns, and apply such a function to each time series as I go. However, I wanted to know if there is functionality in Awkward that would allow me to more efficiently apply a specified function across a column/row (like the 'apply' function in Pandas), or if I could perhaps represent the data in a more efficient way than the above that would help me do so?

Any thoughts very much appreciated!!

jpivarski commented 4 years ago

If you're just trying to represent missing data, I would strongly recommend MaskedArray or IndexedMaskedArray over SparseArray. The idea of SparseArray is that the number of elements is so large that it couldn't be allocated in memory (e.g. trillions) but a tiny minority are actually filled. SparseArray is the only array type that does not have linear lookup time in the number of elements (it's O(log(n))).

Maybe your actual application isn't all floating-point, but if it is, there's nothing more efficient than a NumPy array using np.nan for missing values.

For "applying a specified function across a column/row," it sounds like you're interested in ufuncs, but you already mentioned them earlier, so I think you mean something different. Do you mean cumulative functions whose value at i could depend on everything from 0 up to i? If so, that's a "loop carried dependency" and Numba would be your best bet. If it's a function whose value at i only depends on array quantities at i (possibly from multiple arrays, or equivalently, multiple fields of a Table), then the best way to compute it is to build it up from NumPy ufuncs.

Keep in mind, though, that Pandas's apply(lambda) is not any more efficient than a for loop—it can't be, it still has to evaluate the Python function you give it at every row, and it's the Python evaluation that's the bottleneck. NumPy ufuncs and Numba JIT-functions get their efficiency from compilation. NumExpr is a very lightweight virtual machine (almost compilation) that can beat NumPy's precompiled functions because it makes fewer passes over the same data in memory. Pandas's apply(string) uses NumExpr. We've been talking about adding NumExpr support to Awkward 1.0; it would be similar to the ufunc implementation.

matteogales commented 4 years ago

Thanks for the quick response, and advice regarding usage of SparseArray. I suspect (this may see the light of day in a general purpose library, so can't confirm) that most of the data will be represented as floating-point values and that the values that are present will generally greatly outnumber those that are missing, so MaskedArray seems like a better option.

While the ufuncs and reducers are very useful, there are some functions that we would like to apply which require access to all the datapoints for a time series (i.e one MaskedArray within a Table Row) for calculation, and could not be built up incrementally like the reducers are. For example, we may want to apply a function calculating the slope of a time series to the entries in a Table Row and get back an array of floats, where each float represented the slope of a time series in that Row.

I appreciate that somehow a function needs to be applied to each value (which as described above, would be an array representing a time series) in an array, and as the result of applying such a function could either be a float or perhaps an array of a different length (such as if we applied a moving average, which resulted in a smaller array as the first few values in an array may not have enough previous values to generate a value), so we would be generating a new array anyhow.

However, as well as providing a convenience, some blog posts that I came across said that the Pandas apply function also took advantage of a number of internal optimisations, such as using iterators in Cython, and timings indicated that it provided a noticeable pick up from just applying a for a loop in Python, so just wondered if there some equivalent that I could make use of.

The NumExpr support coming in Awkward 1.0 sounds interesting, but I guess like the ufunc support this functions gets applied to individual values of the most nested array, rather than to the values (which themselves could be arrays) in an array at a given level of nesting; hope that makes sense.

jpivarski commented 4 years ago

I think I'm beginning to get what you're looking for: do you mean windowing functions? That is, the value at index i depends on indexes in i-window:i+window or perhaps just i-window:i (where i represents the present and you don't know the future). So for instance, when you're talking about "the slope of a time series," do you mean a local slope? Like, a slope that depends on i and its neighbors?

Awkward doesn't have any specific tools for functions based on sliding windows. Pandas sure does: there's a whole DataFrame.rolling subsystem for these sorts of functions. It's not that Awkward couldn't have sliding window capabilities, but Awkward's main focus is on nested data structures, and it's not clear to me how sliding windows and nested data should interact. (Should the sliding window ride the surface above all the data structures, or somehow descend into them in a sliding way?) I'd have to see a use-case.

It sounds to me like your use-case doesn't involve many (any?) nested data structures, either. Unless you explicitly foresee a need for anything beyond columns of possibly missing numbers, I would strongly recommend using Pandas. It was designed specifically with time series data and missing values in mind. The fact that it has a Series data type extractable from the R-like DataFrame and these (unlike SQL) are sensitive to order, with rolling functions, is specifically for times series data. Every column is nullable (also unlike SQL) so that missing data can be handled ubiquitously. You would need a strong argument not to use Pandas for this case.


The point about the efficiency of apply is tangential to your main concern, but I have a "first-order approximation" of Python efficiency that has worked rather well: code is slow if O(n) Python instructions need to be called for a dataset of size n and fast if O(1) need to be called. Thus, NumPy has one Python call to enter a compiled loop, but a Python for loop has as many Python calls as the data you're iterating over.

Unless Pandas inspects the contents of your Python lambda function and replaces it with something else (which is what Numba does), it's in the O(n) category. They might use iterators in Cython to set up a row for iteration, but to first-order, it doesn't matter if the user-defined function is calling Python in every step. Pandas can only speed up the set-up, not the evaluation of a slow function.

Since you're trying to optimize your project and it sounds like Pandas is the right tool for your use-case, I would look for ways to build your functions out of predefined, precompiled pieces. For instance, if you have windowing functions to compute

then you can compute linear fits entirely in ufuncs:

delta         = (sumw * sumwxx) - (sumwx * sumwx)
intercept     = ((sumwxx * sumwy) - (sumwx * sumwxy)) / delta
intercept_err = sqrt(sumwxx / delta)
slope         = ((sumw * sumwxy) - (sumwx * sumwy)) / delta
slope_err     = sqrt(sumw / delta)

(I had to dig up some 8-year old code to remember how I used to do that.) Stay within a collection of precompiled or NumExpr-optimized expressions, and you won't suffer the penalty of the Python VM.

matteogales commented 4 years ago

I would generally agree with you on Pandas; it is a great library that provides a lot of functionality for storing and processing time series data. For a lot of our use cases, I'm sure that using a DataFrame would be sufficient for the needs of the problem. However,...

In a general setting the time series in a problem domain can be heterogeneous where time indices vary across cases and/or features (e.g. unequal length series where time indices may not align). In a classification problem, one feature for a case may also have time series of length 1 (a classification label) whereas all the other features may have time series of different lengths.

While for an individual case we could use a DataFrame where columns represent the time series for that case and there is a special column (or possibly piece of metadata) for any classification value, we may need to pad the columns extensively depending on the time series for the case and signal when a value is missing as opposed to NaN (although the sparse data structures and new value to distinguish between NaN and missing values added to Pandas in more recent versions certainly would help in that respect), and possibly store several DataFrames when a data set has multiple cases.

You can find a more involved discussion on why a Pandas DataFrame (at least in its conventional form) was not chosen in the below URLs. Currently (as you can see below, the data container representation has been an ongoing debate for some time!), we use a Pandas DataFrame where each entry in a column is a (nested) Pandas Series. While also proving not intuitive to new users (although we provide helper functions to convert from traditional DataFrames and Numpy arrays to/from this nested representation) the nesting and un-nesting process can also be a performance bottleneck for large problems where a pipeline of data transformations are used.

https://github.com/alan-turing-institute/sktime/wiki/Data-representation-in-sktime https://github.com/alan-turing-institute/sktime/issues/15

It is due to the above why we thought that Awkward Arrays may help. You are right in that the level of nesting is shallow (essentially a 2-D array represented as a Table, where each entry in the table is a MaskedArray, but where the array may be of a different length to other table entries), but the ability to represent time series of different lengths natively, and to efficiently apply functions across these could be very useful and help with some of the current issues.

The windowing functions could be useful, but more often (at least from the examples I have seen) so could functions that use the entire set of data points for a time series (MaskedArray), such as just applying a simple linear regression on a X-Y time series and returning the slope (i.e. the X coefficient). In terms of the output (but not the processing), the return value in that case would be the same as applying a reducer such as mean() to the table (i.e. a 2-D array of float, except the floats would represent the slope of a time series instead of the mean value).

From what you are saying, it looks like while there is no equivalent of the Pandas apply() function in Awkward it would not give us that much and we should look at ways of optimising the evaluation of functions (that are not any of the supported ufunc or reducers), possibly using some of the techniques that you kindly listed above. Thank you for the pointers; I'll have a look through the code that you sent the link for later on to see if we can make use of it.

jpivarski commented 4 years ago

Okay, I'll take your word for it that you need nested data structures. At least variable-length lists.

functions that use the entire set of data points for a time series

Isn't that what ufuncs and reducers are? Like, if you want the intercept of a whole dataset at once, isn't that

>>> x = np.linspace(3, 10)
>>> y = np.linspace(5, 15)
>>> sumw = len(x)
>>> sumwx = numpy.sum(x)
>>> sumwxx = numpy.sum(x**2)
>>> sumwy = numpy.sum(y)
>>> sumwxy = numpy.sum(x*y)
>>> delta = (sumw * sumwxx) - (sumwx * sumwx)
>>> intercept = ((sumwxx * sumwy) - (sumwx * sumwxy)) / delta
>>> intercept
0.7142857142856882

using only ufuncs? Since this uses only ufuncs, you can pass it through any Awkward Array (if the data hadn't been flat, all of the above would still hold; except that you'd have to call .sum().sum() on jagged arrays or otherwise determine your favorite value in each nested list). Since it's not a Python lambda, there's no Pythonic slow-down.

If this isn't what you're looking for and rolling windows aren't what you're looking for, I don't know what you're looking for.

matteogales commented 4 years ago

I’ll try and provide an example that hopefully helps (although I would also completely understand if you wanted to wash your hands of all of this!)...

Say we have a data set which has 2 features, x and y, and 2 cases (i.e. where x and y were measured together on 2 separate occasions). In this example, the length of the time series for each feature varies across cases, and between features within a case. We could represent this very nicely as a nested array of floats similar to the below using Awkward.

[ [x1, x2, x3] , [y1, y2, y3, y4, y5] [x1, x2, x3, x4, x5] , [y1, y2, y3, y4, y5, y6, y7, y8, y9, y10] ]

I can define a ufunc using Numba, say f(float) -> float, that runs very efficiently using Awkward. However looking at what I get back it seems to be akin to the below (i.e the function is applied to each scalar value independently).

[ [f(x1), f(x2), f(x3)] , [f(y1), f(y2), f(y3), f(y4), f(y5)] [f(x1), f(x2), f(x3, f(x4), f(x5]) , [f(y1), f(y2), f(y3), f(y4), f(y5), f(y6), f(y7), f(y8), f(y9), f(y10)] ]

What I would like to do is define a generic function, say g([float])->float, that gets applied to each array nested at the first level of nesting.

[ g([x1, x2, x3]) , g([y1, y2, y3, y4, y5]) g([x1, x2, x3, x4, x5]) , g([y1, y2, y3, y4, y5, y6, y7, y8, y9, y10]) ]

Reducers at first seemed to kinda do the above. However, from what I can determine from blog posts the way these functions are evaluated is by applying a binary function across an array, using the previous function result as the first function argument after the first invocation for an array, so would seem to limit the type of functions that can be defined (?).

More generically, such as in windowing functions, it would also be very useful if the function applied to each array could itself also return an array i.e. g([float]) -> [float]. In this case, I guess that a reducer would also not work (?).

If the above is correct (Python is not my main programming, so please do correct me if any of the above/below is wrong) then ufuncs or reducers don’t seem to be able to be used to define more generic functions. If this is correct, then this would leave the option of creating a simple routine that:

  1. Creates a new 2-D array with the same number of rows/columns as the top-level array above.
  2. Inserts the results of the provided function (i.e. g([float]) -> float or g([float]) -> [float]) as elements into the array created in 1.
  3. Returns the new 2-D array created in 1 to the caller.

If there is some way that I can do the above using ufuncs or reducers then any pointers would be very much appreciated. This is purely to improve speed, as iterating through the top-level array is very easy, but I thought I should check as the library provides some very nice/rich features and Numpy is certainly not my area of expertise so thought I could well be missing something.

jpivarski commented 4 years ago

Right: g([x1, x2, x3]) is more general than a reducer, which is r(r(x1, x2), x3). It's even more general than a fold, which relaxes the data type by taking an identity r0 and computing r(r(r(r0, x1), x2), x3); now the type signature of r can be r : (X, Y) → X, rather than r : (X, X) → X.

Your function g might do different things with x3 than it does with x1—in general, it can't be expressed in terms of ufuncs ("map") and reducers.

However, if the function g you're interested in happens to be something like mean, variance, covariance, or a linear fit, then it can be expressed as ufuncs and reducers. To do the same example with depth-1 jagged arrays (as in your example), you could compute intercepts like this:

>>> x = awkward.fromiter([[1.0, 2.0, 3.0], [1.0, 2.0, 3.0, 4.0, 5.0]])
>>> y = awkward.fromiter([[1.1, 2.3, 3.0], [0.9, 1.9, 3.2, 3.9, 5.3]])   # same lengths
>>> sumw = x.count()
>>> sumwx = x.sum()
>>> sumwxx = (x**2).sum()
>>> sumwy = y.sum()
>>> sumwxy = (x*y).sum()
>>> delta = (sumw * sumwxx) - (sumwx * sumwx)
>>> intercept = ((sumwxx * sumwy) - (sumwx * sumwxy)) / delta
>>> intercept
array([ 0.23333333, -0.2       ])

The .count() and .sum() reducers turn depth-N jagged arrays into depth-(N-1) jagged arrays (where depth-0 is a flat NumPy array). If you wanted to do something simpler like mean, it would be sumwx/sumw, a ufunc (/) of two arrays produced by reducing. Mean, variance, covariance, and linear fits are not reducers, but they can be expressed in terms of ufuncs and reducers. In fact, this is explicit in the Awkward Array code, which defines mean, var, and std in terms of ufuncs and reducers, but they are not themselves reducers.

To be clear, this would not be possible for any arbitrary function g. For instance, it's impossible for median. For a general function, I think you'd have to write Python or Numba. But you were also talking about linear fits and so I don't want you to miss this important special case.

More generically, such as in windowing functions, it would also be very useful if the function applied to each array could itself also return an array i.e. g([float]) -> [float]. In this case, I guess that a reducer would also not work (?).

I think what you're asking for here can't be done in terms of ufuncs and reducers unless it's a ufunc itself. Reducers necessarily reduce the dimension, so g([float]) couldn't be turned into [float]. Ufuncs can satisfy this signature if you were interested in strict mapping, which you're not.

There isn't (right now) a good way to do window functions in Awkward.

matteogales commented 4 years ago

Thanks for the additional information; very useful, and educational.

I had already looked at the ufuncs and reducers defined in the library, and they do correspond to several of the functions that we would like to apply. Initial timings suggest that using Awkward would provide a significant performance increase from our existing implementations.

I also suspect that several of the functions we would like to apply can be expressed in terms of ufuncs and reducers (including linear fits), so having a reference to relevant pieces of code will be very useful when we looking at implementing those; thank you for that.

I am hopeful (TBC) that even for more generic functions being able to more easily slice and dice through nested arrays will prove more efficient than our current scheme. If that proves to be the case (or at least, if it is no worse), Awkward would seem a good fit for our use case.

Thank you for developing such a very useful resource.