Help defining the problem space

tomthetrainer commented 7 years ago

@turambar @bpark738

If you could answer these questions that would come from a student then I can incorporate that into the class.

What is a Sequence?

Are my log files a sequence, are my financial transactions a sequence? Why and how do they differ?

RNN gets sequence in, what does it emit as output?

Univariate, multivariate, resampling: What are these and when do I need to resample?

What is the relationship of the time series to the input layer? In a MLP I have 4 features then I have 4 inputs, in an RNN what is the relationship. Input to MLP is a vector, value, tensor, what is the shape of that input, and compared to an RNN what is the shape of the RNN's input?

If I have log files with events that occur randomly with millisecond granularity, how do I handle that? Split on the second? Summarize per second? Split on the millisecond? What is the effect of those decisions on the complexity of my NN?

Processing input seems a lot like "expert feature engineering" is it? Or can it be simplified in some way?

What about Video content? Feels like a RNN, sequences of images, but perhaps a CNN because they are images?

If my sequence size is fixed, say temperature reading every minute from some lab experiment and I split that on day, would an MLP be as useful or do I need an RNN?

bpark738 commented 7 years ago

I tried my best to answer. Let me know if something is unclear.

@turambar Please butt in if something seems off.

What is a Sequence? An ordered list of numbers or objects. Ex: Univariate sequence of numbers: 10, 8, 11, 12, ... Ex: Multivariate sequence of numbers: (1, 3, 4), (3, 5, 10), (13, 15, 17), ...

Are my log files a sequence, are my financial transactions a sequence? Why and how do they differ? I would say yes to both. Log file can be thought of as a sequence of text and financial transactions a sequence of numbers. Both are ordered lists of objects. They differ since the objects are different.

Univariate, multivariate, resampling: What are these and when do I need to resample? Univariate data: Data involving one variable Multivariate data: Data involving two or more variables Resampling: In the physionet resampling, dividing up irregular time steps of a time series into regular intervals or buckets by taking the mean value/s (if multivariate) of timesteps within a bucket. If no timestep in a bucket, then previous value propagated forward (not sure if previous bucket value or previous timestep value, Dave?). I think resampling is optional. It may help classification since timesteps become regular. In resampled data, Time and Elapsed columns are removed since they are no longer needed (timesteps in resampled data are now regularly spaced and sized). Elapsed originally gave how much time since last timestep in original data but is useless for resampled data.

RNN gets sequence in, what does it emit as output? Depends on what the task is. For the physionet data, the task was a univariate output, mortality (which was not time series output). Therefore, it emits one number at the end of the time series input sequence. If the output is univariate but time series this time, then a sequence of univariate numbers is output from the RNN. The RNN emits a univariate output number at each timestep of the data, resulting in a sequence of outputs.

What is the relationship of the time series to the input layer? In a MLP I have 4 features then I have 4 inputs, in an RNN what is the relationship. Input to MLP is a vector, value, tensor, what is the shape of that input, and compared to an RNN what is the shape of the RNN's input? In a RNN, the input consists of the 4 features like a MLP. However, the hidden layer of a RNN also has connections to the hidden layer of the RNN at the previous timestep. So the RNN basically takes in features one timestep at a time. Therefore, the input layer of a RNN would be similar to the input layer to a MLP if the input at each timestep of a RNN is similar to the input of a MLP.

If I have log files with events that occur randomly with millisecond granularity, how do I handle that? Split on the second? Summarize per second? Split on the millisecond? What is the effect of those decisions on the complexity of my NN? I am unsure of the difference of splitting on the second and summarizing per second and the best way of handling that. But splitting on the millisecond vs splitting on the second would make the data more granular. Therefore, the NN would become more complex, since it would have to fit to more detailed data.

Processing input seems a lot like "expert feature engineering" is it? Or can it be simplified in some way? I think it depends on what you're doing to the input. For example, I wouldn't consider extracting the hour of day from a date time variable to be expert feature engineering even though that counts as processing input.

What about Video content? Feels like a RNN, sequences of images, but perhaps a CNN because they are images? I think you could combine the two ideas. Like a convolutional recurrent neural network.

If my sequence size is fixed, say temperature reading every minute from some lab experiment and I split that on day, would an MLP be as useful or do I need an RNN? Not sure what you mean by splitting on day. Do you mean you take the average temperature reading per day? If so the data would still be a sequence, so I think a RNN would be useful (assuming there are temperature readings from different days).

eraly commented 7 years ago

What is a Sequence? SE - Anything with an inherent ordering is a sequence. Most often the ordering comes from time but not necessarily like the case of a sentence "this is a sentence".

Are my log files a sequence, are my financial transactions a sequence? Why and how do they differ? SE - I would argue they are both sequences since they both track some activity across time and the order of this activity matters. Obviously the right kind of preprocessing would have to be applied before they can be turned into a sequence.

RNN gets sequence in, what does it emit as output? SE - Technically the RNN is capable of emitting an output at every "step" so the output is also a sequence. But depending on how we have architected training we might be interested in only the output at the last step, the last few or all the steps. Karpathy's blog (and I think we have them in our slides too) has a good picture of the different architectures

Univariate, multivariate, resampling: What are these and when do I need to resample? SE - In general, univariate means we model just one dependent (output) variable based on some input data. Multivariate is when we model a group of dependent variables from some input data. I am not sure what resample means in this context. Sampling or resampling usually refers to the frequency or rate at which we capture data from a sequence. There is a lot of theory around this regarding the minimum sampling rate blah blah but don't think any of this applies in this context. Or maybe it does?

What is the relationship of the time series to the input layer? In a MLP I have 4 features then I have 4 inputs, in an RNN what is the relationship. Input to MLP is a vector, value, tensor, what is the shape of that input, and compared to an RNN what is the shape of the RNN's input?

SE -
MLP is minibatch_size x feature_vector_size. RNNs have an additional dimension that corresponds to the sequence order (usually time). In dl4j this is the last dimension. So minibatch_size x feature_size x sequence_step.

If I have log files with events that occur randomly with millisecond granularity, how do I handle that? Split on the second? Summarize per second? Split on the millisecond? What is the effect of those decisions on the complexity of my NN? SE - It depends. Trite yes, but true :) This kind of goes back to sampling. You want to sample at a rate that allows you capture the granularity of the event you are looking for. If whatever "event"/output you want to learn about happens on a split second scale you probably want to keep things at that granularity or else you risk capturing relevant information. Training on longer sequences take longer. Your network size is unaffected by the number of time steps you train on...

Processing input seems a lot like "expert feature engineering" is it? Or can it be simplified in some way? SE Processing input in the context of deep learning should be setting up inputs correctly - i.e properly transformed (as categorical, ordinal whatever) and normalized.

What about Video content? Feels like a RNN, sequences of images, but perhaps a CNN because they are images? SE - We have preprocessors that will handle this kind of input. And we have an example that classifies images in a video frame by shape. https://github.com/deeplearning4j/dl4j-examples/tree/master/dl4j-examples/src/main/java/org/deeplearning4j/examples/recurrent/video

If my sequence size is fixed, say temperature reading every minute from some lab experiment and I split that on day, would an MLP be as useful or do I need an RNN? SE - An MLP would still be useful. But you have to make some choices about how the data is fed into the MLP. For eg. decide on how many time steps the MLP is going to see as input. The RNN will be able to learn longer dependencies though. And I am inclined to say setting up an RNN with what is already sequence data is easier than setting up a pipeline to convert sequence data to non-sequence MLP style data..

turambar commented 7 years ago

@bpark738 thanks for taking the initiative to answer @tomthetrainer's questions here.

@tomthetrainer let's definitely do a post-mortem after the workshop to discuss any lingering questions, etc.

tomthetrainer / June26Class

Help defining the problem space #7