ludwig-ai / ludwig

Low-code framework for building custom LLMs, neural networks, and other AI models
http://ludwig.ai
Apache License 2.0
11.09k stars 1.19k forks source link

Sequences with multiple features? #628

Open baasman opened 4 years ago

baasman commented 4 years ago

I do a lot of time series analysis using Keras and RNN layers. Generally, we have more than one feature per time step, and thus I structure my data as such:

[[[1,2], [2, 1]], [[3, 4], [2, 6]]]

where the dimensions are number_of_timesteps * number_of_features * sample_size, so in this case we have a sample of two, each with two time steps and two features at each time step. Is it possible to model this using Ludwig, and if so, how would I structure my pandas dataframe to be able to properly process this?

ifokeev commented 4 years ago

You have a very interesting case. Explain more, please.

baasman commented 4 years ago

I understand (and see examples of) how to structure a univariate time series in a pandas DataFrame. It looks like we simply create a space delimited object representing the sequence in each cell. However, what if we have a multivariate case, where we are interested in the interaction between the features, and thus want to pass multiple values at each timestep. I work in healthcare, so an example would be having both heart rate and blood pressure for a 24 hour period, split into one hour chunks, and were trying to predict whether or not an event happens following that 24 hour period (many to one) .That is different, unless you tell me otherwise, from creating many univariate rnn encoders.

Being able to pass a cell like the following to a rnn encoder would be very valuable: 1 2, 6 1, instead of two separate cells with separate encoders: 1 6 and 2 1

ifokeev commented 4 years ago

@baasman why separate cells don't work?

w4nderlust commented 4 years ago

@baasman thanks for the detailed explanation. We are working on a multivariate timeseries input type that will do exactly for your cae (although the order of simensions will be batch_size x length of the sequence x features). It is pretty easy to implement, if someone wants to help out on this it would be great, I can give precise and easy instructions on how to go and do it.

At the same time, you can already do multivariate regression in a slightly different way. You assuming your multivariate time series has 3 dimensions, you would need to have 3 input features in your Ludwig configuration file (make sure all rows have the same length!). Each of them can be encoded separately, but if you don't want it, you can use the PassthroughEncoder by specifying encoder: passthrough or encoder: null in your YAML file, and also specify reduce: null. Then add in the combiner section of your model definition a sequence combiner (see details in the User Guide). What it does is concatenating the features among the second dimension (length of the sequence) and then providing them to a sequence encoder (all of the usual ones are available: rnn, parallel_cnn, stacked_cnn, etc.). You have to specify:

combiner:
    type: sequence
    encoder: you_choice
    ... encoder parameters ...

This works also for other types of sequential input features, if for instance you have a class assigned to each step of you time series, this approach will allow to use both.