question on how the examples work

lorrp1 commented 3 years ago

Hello, i have a few questions regarding the exmaples: 1)from what i understood the model (lets say LSTnet) get as input a target and the data with features (same size of the target + target itself) then update params to make it fit the initial target and the result is an array with the length of the initial size/target i dont get why the model already get as input most of the target (here m.ConvLayer(x)) (aside from the last few shifted target with the horizon) so it could just output the input as target with random the last part of the horizon and it would still return an almost "perfect fit"?

assuming it actually forecast without already getting as input the part it should forecast:

3)is the forecast value at time x only last point of the horizon to forecast from time y = x - horizon for any point of the time series?

how is the loss calculated if the last few part in the plot (the horizon of the forecast) are outside the range of the input target (at least in the plot)?

im trying to recreate something like this: https://github.com/jingw2/demand_forecast using these models, and maybe it is the same thing the examples.jl here do when plotting the last part of the horizon outside the target range, but im not sure.

sdobber commented 3 years ago

Hi @lorrp1 ,

sorry for the late reply.

The model gets as input a number of features from a time series, say from time t - poollength to time t. The task is to predict a future value t + horizon from these features. Following the performance guide of Flux, all features get assembled in a matrix, where the last dimension corresponds to the different timesteps.

To give the model the past time series up to some point that is currently observable for a certain point in time, the poollength parameter provides a window of poollength steps back in time for all the features. All this happens in the second dimension of input: input[:,1,1,60] is a vector of all the feature values at the 60th timestep, and input[:,2,1,60] would be a vector of the feature values one step back from the 60th time step. (So for example input[:,2,1,61] == input[:,1,1,60] holds true.) If you want to include the time series from which the target is derived as a feature or not is up to you, but I see no reason not to supply the model with what is known up to the current point in time.

m.ConvLayer operates only on the first two dimensions, so for a fixed timepoint it should only be able to access the features for the current timestep and poollength steps back in time. It's output is then a time series with convlayersize new "features". This way, the model should not have access to future points in time, though I admit that I never checked that thoroughly.

The training now tries to optimize the model parameters so that the model output (seen as a time series) matches the target, which means that at time t, it should predict the target variable at time t + horizon.

What you mention about that the model basically just could output more or less the current point in time to get an almost perfect fit is a general problem in time-series forecasting and applies to all methods, which makes it a difficult problem.

Concerning your question number 3, I am not an expert on jingw2's forecasting setup, but it looks very different to what I am doing in my code. I am only interested in forecasting a short amount of time, and for a new forecast I have new data available I can feed to the model. So I use basically everything I have available up to a certain time to come up with a prediction, until the data is exhausted. On https://github.com/jingw2/demand_forecast, it looks like they train the model over a certain dataset, and then let it output forecasts for a longer period of time. I would try to inspect their code to see if one can get some hints about how they train, and how they forecast.

lorrp1 commented 3 years ago

thank you for the explanation. i think i have now understood how the model works.

but is there no easy way then to make the model accept a lower input data x in a:b:c:x for a model that was trained initially for a:d:c:y with y > x? (i mean using pred = model(input)on a trained model

when i try: pred = model(input) with a smaller data length from the the one used to initially train the model i get: DimensionMismatch("arrays could not be broadcast to a common size")

the Conv((in, poolsize)should be equal assuming there is enough "input data" for the pool size since there is not initialization of the data length in the model's initialization. the error is in: m.RecurLayer (a) (i have used a = m.ConvLayer(x) to see if the error was in the conv or recurrent layer)

the convlayersize is also be the same so i dont really understand the error on the recurrent layer

edit: another issue would be how to know if it isnt just overfitting without a test sample or a validation one

sdobber commented 3 years ago

With the way Flux treats recurrent layers, their hidden state gets initialized to the correct size for the input data the first time you call a layer. When the size changes (e.g. by changing from training to test data), you get the DimensionMismatch("arrays could not be broadcast to a common size") error. The solution is to call Flux.reset!(model) before changing the size of the input. I normally include that in the loss function (that was mentioned in the documentation at one point, but now it seems to have been removed):

    loss(x,y)= begin
      l = Flux.mse(model(x),y)
      Flux.reset!(model)
      return l
    end

An of course it is a good idea to have a training, validation and test data set. I just wanted to keep my code simple to focus on the networks, and not build a hole data handling structure around everything :grin:

lorrp1 commented 3 years ago

thank you again, im going to add test/validation maybe even changing the adam while training.

lorrp1 commented 3 years ago

It seems the models return nan every time the pool length is higher than 2/3

sdobber commented 3 years ago

I tried LSTNet with poollength = 1, 2, 5, 10, 15, 20, 50, 100, and that all worked fine. The variable defines a number of timesteps, so any non-integer values doesn't make sense.

lorrp1 commented 3 years ago

By 2/3 i meant 2 or 3, but its now working, i changed how the dataloader to read csv but i made a mistake. do you think it would be enough to change the size of the output of the last dense (and the loss) to turn LSTnet it into a classifier?

sdobber commented 3 years ago

Might be worth a try. For my use case classification never really worked out, so my experience with this is limited.

sdobber / FluxArchitectures.jl

question on how the examples work #7