mlverse / tft

R implementation of Temporal Fusion Transformers
https://mlverse.github.io/tft/
Other
26 stars 9 forks source link

Error in `verify_new_data()` #39

Open Sanaxen opened 2 years ago

Sanaxen commented 2 years ago
Error in `verify_new_data()`:
! `new_data` includes obs that we can't generate predictions.

What is the cause of this error? It may be a coincidence, but this error seems to occur with other data as well, apparently monthly data.

For example, the famous `AirlinePassengers.csv' also gives this error

cregouby commented 2 years ago

Hello @Sanaxen,

You get the cause of it in there : https://github.com/mlverse/tft/blob/226d21faba832a85c436f73335a6a917d892c9ba/R/predict.R#L148-L160

Or, in plain english, you have some observations with the same index and key in common between your past_data and your new_data

This smells like data leakage, that you want to avoid at any cost.

Hope it helps !

Sanaxen commented 2 years ago

Hi, @cregouby,

Data divided into train, valid, and test

> print(train,n=100)
# A tibble: 84 x 4
# Groups:   key [1]
   date       target key   self_adding_date_time_since_bg
   <date>      <dbl> <fct>                          <dbl>
 1 1949-01-01    112 0                                 0 
 2 1949-02-01    118 0                               268.
 3 1949-03-01    132 0                               510.
 4 1949-04-01    129 0                               778.
 5 1949-05-01    121 0                              1037.
 6 1949-06-01    135 0                              1305.
 7 1949-07-01    148 0                              1564.
 8 1949-08-01    148 0                              1832.
 9 1949-09-01    136 0                              2100.
10 1949-10-01    119 0                              2359.
11 1949-11-01    104 0                              2627.
12 1949-12-01    118 0                              2886.
13 1950-01-01    115 0                              3154.
14 1950-02-01    126 0                              3421.
15 1950-03-01    141 0                              3663.
16 1950-04-01    135 0                              3931.
17 1950-05-01    125 0                              4190.
18 1950-06-01    149 0                              4458.
19 1950-07-01    170 0                              4717.
20 1950-08-01    170 0                              4985.
21 1950-09-01    158 0                              5253.
22 1950-10-01    133 0                              5512.
23 1950-11-01    114 0                              5780.
24 1950-12-01    140 0                              6039.
25 1951-01-01    145 0                              6307.
26 1951-02-01    150 0                              6575.
27 1951-03-01    178 0                              6817.
28 1951-04-01    163 0                              7085.
29 1951-05-01    172 0                              7344 
30 1951-06-01    178 0                              7612.
31 1951-07-01    199 0                              7871.
32 1951-08-01    199 0                              8139.
33 1951-09-01    184 0                              8407.
34 1951-10-01    162 0                              8666.
35 1951-11-01    146 0                              8934.
36 1951-12-01    166 0                              9193.
37 1952-01-01    171 0                              9461.
38 1952-02-01    180 0                              9729.
39 1952-03-01    193 0                              9979.
40 1952-04-01    181 0                             10247.
41 1952-05-01    183 0                             10506.
42 1952-06-01    218 0                             10774.
43 1952-07-01    230 0                             11033.
44 1952-08-01    242 0                             11301.
45 1952-09-01    209 0                             11569.
46 1952-10-01    191 0                             11828.
47 1952-11-01    172 0                             12096 
48 1952-12-01    194 0                             12355.
49 1953-01-01    196 0                             12623.
50 1953-02-01    196 0                             12891.
51 1953-03-01    236 0                             13133.
52 1953-04-01    235 0                             13401.
53 1953-05-01    229 0                             13660.
54 1953-06-01    243 0                             13928.
55 1953-07-01    264 0                             14187.
56 1953-08-01    272 0                             14455.
57 1953-09-01    237 0                             14723.
58 1953-10-01    211 0                             14982.
59 1953-11-01    180 0                             15250.
60 1953-12-01    201 0                             15509.
61 1954-01-01    204 0                             15777.
62 1954-02-01    188 0                             16044.
63 1954-03-01    235 0                             16286.
64 1954-04-01    227 0                             16554.
65 1954-05-01    234 0                             16813.
66 1954-06-01    264 0                             17081.
67 1954-07-01    302 0                             17340.
68 1954-08-01    293 0                             17608.
69 1954-09-01    259 0                             17876.
70 1954-10-01    229 0                             18135.
71 1954-11-01    203 0                             18403.
72 1954-12-01    229 0                             18662.
73 1955-01-01    242 0                             18930.
74 1955-02-01    233 0                             19198.
75 1955-03-01    267 0                             19440 
76 1955-04-01    269 0                             19708.
77 1955-05-01    270 0                             19967.
78 1955-06-01    315 0                             20235.
79 1955-07-01    364 0                             20494.
80 1955-08-01    347 0                             20762.
81 1955-09-01    312 0                             21030.
82 1955-10-01    274 0                             21289.
83 1955-11-01    237 0                             21557.
84 1955-12-01    278 0                             21816 
> print(valid,n=100)
# A tibble: 48 x 4
# Groups:   key [1]
   date       target key   self_adding_date_time_since_bg
   <date>      <dbl> <fct>                          <dbl>
 1 1956-01-01    284 0                             22084.
 2 1956-02-01    277 0                             22352.
 3 1956-03-01    317 0                             22602.
 4 1956-04-01    313 0                             22870.
 5 1956-05-01    318 0                             23129.
 6 1956-06-01    374 0                             23397.
 7 1956-07-01    413 0                             23656.
 8 1956-08-01    405 0                             23924.
 9 1956-09-01    355 0                             24192 
10 1956-10-01    306 0                             24451.
11 1956-11-01    271 0                             24719.
12 1956-12-01    306 0                             24978.
13 1957-01-01    315 0                             25246.
14 1957-02-01    301 0                             25514.
15 1957-03-01    356 0                             25756.
16 1957-04-01    348 0                             26024.
17 1957-05-01    355 0                             26283.
18 1957-06-01    422 0                             26551.
19 1957-07-01    465 0                             26810.
20 1957-08-01    467 0                             27078.
21 1957-09-01    404 0                             27346.
22 1957-10-01    347 0                             27605.
23 1957-11-01    305 0                             27873.
24 1957-12-01    336 0                             28132.
25 1958-01-01    340 0                             28400.
26 1958-02-01    318 0                             28668.
27 1958-03-01    362 0                             28909.
28 1958-04-01    348 0                             29177.
29 1958-05-01    363 0                             29436.
30 1958-06-01    435 0                             29704.
31 1958-07-01    491 0                             29964.
32 1958-08-01    505 0                             30231.
33 1958-09-01    404 0                             30499.
34 1958-10-01    359 0                             30758.
35 1958-11-01    310 0                             31026.
36 1958-12-01    337 0                             31285.
37 1959-01-01    360 0                             31553.
38 1959-02-01    342 0                             31821.
39 1959-03-01    406 0                             32063.
40 1959-04-01    396 0                             32331.
41 1959-05-01    420 0                             32590.
42 1959-06-01    472 0                             32858.
43 1959-07-01    548 0                             33117.
44 1959-08-01    559 0                             33385.
45 1959-09-01    463 0                             33653.
46 1959-10-01    407 0                             33912 
47 1959-11-01    362 0                             34180.
48 1959-12-01    405 0                             34439.
> print(test,n=100)
# A tibble: 12 x 4
# Groups:   key [1]
   date       target key   self_adding_date_time_since_bg
   <date>      <dbl> <fct>                          <dbl>
 1 1960-01-01    417 0                             34707.
 2 1960-02-01    391 0                             34975.
 3 1960-03-01    419 0                             35225.
 4 1960-04-01    461 0                             35493.
 5 1960-05-01    472 0                             35752.
 6 1960-06-01    535 0                             36020.
 7 1960-07-01    622 0                             36279.
 8 1960-08-01    606 0                             36547.
 9 1960-09-01    508 0                             36815.
10 1960-10-01    461 0                             37074.
11 1960-11-01    390 0                             37342.
12 1960-12-01    432 0                             37601.
> spec
A <prepared_tft_dataset_spec> with:

v lookback = 48 and horizon = 12.
v The number of possible slices is 73

-- Covariates: 
v `index`: date
v `keys`: key
v `static`: 
v `known`: self_adding_date_time_since_bg
v `unknown`: 
i Variables that are not specified in other types are considered `unknown`.

past_data = bind_rows(train, valid) new_data = test But I think past_data does not leak (index,key).

> pred <- predict(object = fitted, new_data = test, past_data = bind_rows(train, valid))
Error in `verify_new_data()`:
! `new_data` includes obs that we can't generate predictions.
x Found `12` observations.
Ujjwal4CULS commented 1 year ago

I also found the same error even though my test data date is ahead of the train data.

dfalbel commented 1 year ago

This is a bug. Would be nice if you could paste a reproducible example so it's quicker for us to debug.

Sanaxen commented 1 year ago

https://github.com/jbrownlee/Datasets/blob/master/airline-passengers.csv this data it is very simple. train 84 rows valid 48 rows The remaining 12 rows are predictions.

dfalbel commented 1 year ago

Thanks