ludwig-ai / ludwig

Low-code framework for building custom LLMs, neural networks, and other AI models
http://ludwig.ai
Apache License 2.0
10.97k stars 1.18k forks source link

Forecasting time series clarifications needed #213

Open oren0e opened 5 years ago

oren0e commented 5 years ago

Hi, Couple of things I don't get: I made an input csv data as described, along the lines of:

y,y1,y2,y3
1 2 3 32 4 56 66,37,18,23

Just with much more data for y (the time-series which are numbers separated by white spaces) and also the outputs go from y1 to y24 as each one of them represents an hour, so I want to forecast an hourly series a day ahead. After I ran

ludwig experiment --data_csv data_for_model.csv --model_definition_file model_definition.yaml

When it ran I saw just the training part active, the validation and test were always 0. I thought that supplying the values for y1 to y24 will be used to test the predictions that are supposed to be made for these outputs (y1 to y24).

After that I issued:

ludwig visualize --visualization learning_curves --training_statistics ./results/experiment_run_2/training_statistics.json 

And I got plots for each output, but the validation line was flat zero...why is that? And where can I see the actual numerical predictions? something simple like: y24 value: 43 y24 pred: 40 etc.

Last, I have to say that the visualization part is very non-intuitive. It would be nice to see a long example for a use-case with all the visualization arguments used in the user guide.

Thanks!

w4nderlust commented 5 years ago

@oren0e can you provide your yaml file please? You should see 24 different results (a box with loss and measures for train vali and test) at each epoch, are they all 0s? Can you provide the prints? My suspect is that the data didn't get split for some reason. Regarding predictions, as stated in the documentation, they are saved in the results directory, one csv for each output feature. They are aligned with the test datapoints. Finally regarding the visualizations, it seems it was intuitive enough for you to be able to obtain your plots :)

oren0e commented 5 years ago

Yes of course, the yaml:

input_features:
    -
        name: y
        type: timeseries

output_features:
    -
        name: y1
        type: numerical
    -
        name: y2
        type: numerical
    -
        name: y3
        type: numerical
    -
        name: y4
        type: numerical
    -
        name: y5
        type: numerical
    -
        name: y6
        type: numerical
    -
        name: y7
        type: numerical
    -
        name: y8
        type: numerical
    -
        name: y9
        type: numerical
    -
        name: y10
        type: numerical
    -
        name: y11
        type: numerical
    -
        name: y12
        type: numerical
    -
        name: y13
        type: numerical
    -
        name: y14
        type: numerical
    -
        name: y15
        type: numerical
    -
        name: y16
        type: numerical
    -
        name: y17
        type: numerical
    -
        name: y18
        type: numerical
    -
        name: y19
        type: numerical
    -
        name: y20
        type: numerical
    -
        name: y21
        type: numerical
    -
        name: y22
        type: numerical
    -
        name: y23
        type: numerical
    -
        name: y24
        type: numerical

Part of the output from the terminal (when ludwig finishes):

===== y16 =====
error: []
loss: 0
mean_absolute_error: 0
mean_squared_error: 0
predictions: []
r2: 0

===== y17 =====
error: []
loss: 0
mean_absolute_error: 0
mean_squared_error: 0
predictions: []
r2: 0

===== y18 =====
error: []
loss: 0
mean_absolute_error: 0
mean_squared_error: 0
predictions: []
r2: 0

===== y19 =====
error: []
loss: 0
mean_absolute_error: 0
mean_squared_error: 0
predictions: []
r2: 0

Also, in the results folder in the relevant experiment_run folder, I have no csv's at all, just the model folder and description.json, prediction_statistics.json, and training_statistics.json.

Just to see if I understood correctly: If I were to leave the fields of y1-y24 empty in the data csv file, it would then make predictions for a "new unseen" data? (that also does not work by the way).

Part of one of the last epoch's output:

No datapoints to evaluate on.
No datapoints to evaluate on.
Took 0.3834s
╒═══════╤════════╤══════════════════════╤═══════════════════════╤═══════════╤═══════════════╕
│ y1    │   loss │   mean_squared_error │   mean_absolute_error │        r2 │ error         │
╞═══════╪════════╪══════════════════════╪═══════════════════════╪═══════════╪═══════════════╡
│ train │ 0.0000 │               0.0000 │                0.0014 │ -inf      │ [-0.00144851] │
├───────┼────────┼──────────────────────┼───────────────────────┼───────────┼───────────────┤
│ vali  │ 0.0000 │               0.0000 │                0.0000 │    0.0000 │ []            │
├───────┼────────┼──────────────────────┼───────────────────────┼───────────┼───────────────┤
│ test  │ 0.0000 │               0.0000 │                0.0000 │    0.0000 │ []            │
╘═══════╧════════╧══════════════════════╧═══════════════════════╧═══════════╧═══════════════╛
╒═══════╤════════╤══════════════════════╤═══════════════════════╤═══════════╤══════════════╕
│ y2    │   loss │   mean_squared_error │   mean_absolute_error │        r2 │ error        │
╞═══════╪════════╪══════════════════════╪═══════════════════════╪═══════════╪══════════════╡
│ train │ 0.0000 │               0.0000 │                0.0002 │ -inf      │ [0.00017822] │
├───────┼────────┼──────────────────────┼───────────────────────┼───────────┼──────────────┤
│ vali  │ 0.0000 │               0.0000 │                0.0000 │    0.0000 │ []           │
├───────┼────────┼──────────────────────┼───────────────────────┼───────────┼──────────────┤
│ test  │ 0.0000 │               0.0000 │                0.0000 │    0.0000 │ []           │
╘═══════╧════════╧══════════════════════╧═══════════════════════╧═══════════╧══════════════╛
╒═══════╤════════╤══════════════════════╤═══════════════════════╤═══════════╤═══════════════╕
│ y3    │   loss │   mean_squared_error │   mean_absolute_error │        r2 │ error         │
╞═══════╪════════╪══════════════════════╪═══════════════════════╪═══════════╪═══════════════╡
│ train │ 0.0000 │               0.0000 │                0.0002 │ -inf      │ [-0.00015551] │
├───────┼────────┼──────────────────────┼───────────────────────┼───────────┼───────────────┤
│ vali  │ 0.0000 │               0.0000 │                0.0000 │    0.0000 │ []            │
├───────┼────────┼──────────────────────┼───────────────────────┼───────────┼───────────────┤
│ test  │ 0.0000 │               0.0000 │                0.0000 │    0.0000 │ []            │
╘═══════╧════════╧══════════════════════╧═══════════════════════╧═══════════╧═══════════════╛

And a screenshot from excel view of the csv file of the data is attached (I have only 1 row, just to be clear). Screen Shot 2019-03-14 at 11 38 06

w4nderlust commented 5 years ago

The fact that it says no datapoints to evaluate on means that there are no datapoint in the training and test set. How many datapoints are in you original dataset? Please try to do the split manually and provide train validation and test data separately. Let me know if that solves the issue.

oren0e commented 5 years ago

In what way? Does my data is structured in a wrong way (as can be seen in the snapshot from excel)? The first column includes all the data I have from the history for this series..something like 4-3 months on an hourly resolution

w4nderlust commented 5 years ago

Oh yeah, here's the problem then. You are basically providing one single datapoint with all your timeseries. So I guess you have only a single timeseries and you want to predict the next k samples. I'm not a timeseries expert, so there could be better ways, but what I would instinctively do in this case is slice the timeseries, for instance if the sereis is from 1 to 100, you can create a datapoint with x: [1-10] and y: 11, and continue until the end, and then a test time you priveide all your timeseries together. Again, this may be suboptimal, timeseries experts may know how to do this much better.

oren0e commented 5 years ago

Yes but eventually I want to predict other values..so lets say if my series is 1 to 10 and I want to predict 11 12 13, then what you suggest is:

input      output 
1 2 3 4        5
5 6 7 8 9 10   11,12,13

? Basically that way I am predicting 5 as well? And it is what time-series forecasting is about (usually)- taking a single series and predicting the next k values in the series. By the way, in my case I want to make an out-of-sample forecast, i.e. I actually dont have the 11,12,13 values at all.

w4nderlust commented 5 years ago

You can always feed the previously predicted values back. so if your input is 1,2,3,4 and your output is 5, at the next time step you can feed 1 2 3 4 5 and hope to predict 6 and so on. In time series like in text what is supported at the moment is learning from several data points, each a time series / text, to predict something. That's the main use case at the moment. But in language your task is similar to the task of language modeling (predicting the next word given a context of words), which can be done in the way I described in the previous post, but I was trying to find a clean solution in Ludwig to do it automatically, so stay tuned as whatever I end up doing for language modeling can then be used for single time series forecasting.

oren0e commented 5 years ago

I don't understand. Again, lets take a simple case of time series of lets say visits to the doctor each hour.

y = 12 3 14 11 7
t = 1  2  3  4  5

This is the information I have. Now I want to predict the value of y for t=6 (which is the future for me so I don't know its value), can your framework support that? Meaning my file will look like:

y, y1
12 3 14 11 7, 

with the yaml file pointing that y is an input and y1 is an output.

w4nderlust commented 5 years ago

I already explained how to recast the problem in case you have a single time series into several datapoints coming from that timeseries. I'll try to rephrase: take a contiguous subset of you timeseries and train to predict the following element. if you have from 1 to 10, take for instance 1,2,3 as input and 4 as output, then 2,3,4 as input and 5 as output and so on. Then in your test set put the whole series from 1 to 9 and see if it can predict 10.