Open Sanaxen opened 2 years ago
Hello @Sanaxen,
You get the cause of it in there : https://github.com/mlverse/tft/blob/226d21faba832a85c436f73335a6a917d892c9ba/R/predict.R#L148-L160
Or, in plain english, you have some observations with the same index and key in common between your past_data
and your new_data
This smells like data leakage, that you want to avoid at any cost.
Hope it helps !
Hi, @cregouby,
Data divided into train, valid, and test
> print(train,n=100)
# A tibble: 84 x 4
# Groups: key [1]
date target key self_adding_date_time_since_bg
<date> <dbl> <fct> <dbl>
1 1949-01-01 112 0 0
2 1949-02-01 118 0 268.
3 1949-03-01 132 0 510.
4 1949-04-01 129 0 778.
5 1949-05-01 121 0 1037.
6 1949-06-01 135 0 1305.
7 1949-07-01 148 0 1564.
8 1949-08-01 148 0 1832.
9 1949-09-01 136 0 2100.
10 1949-10-01 119 0 2359.
11 1949-11-01 104 0 2627.
12 1949-12-01 118 0 2886.
13 1950-01-01 115 0 3154.
14 1950-02-01 126 0 3421.
15 1950-03-01 141 0 3663.
16 1950-04-01 135 0 3931.
17 1950-05-01 125 0 4190.
18 1950-06-01 149 0 4458.
19 1950-07-01 170 0 4717.
20 1950-08-01 170 0 4985.
21 1950-09-01 158 0 5253.
22 1950-10-01 133 0 5512.
23 1950-11-01 114 0 5780.
24 1950-12-01 140 0 6039.
25 1951-01-01 145 0 6307.
26 1951-02-01 150 0 6575.
27 1951-03-01 178 0 6817.
28 1951-04-01 163 0 7085.
29 1951-05-01 172 0 7344
30 1951-06-01 178 0 7612.
31 1951-07-01 199 0 7871.
32 1951-08-01 199 0 8139.
33 1951-09-01 184 0 8407.
34 1951-10-01 162 0 8666.
35 1951-11-01 146 0 8934.
36 1951-12-01 166 0 9193.
37 1952-01-01 171 0 9461.
38 1952-02-01 180 0 9729.
39 1952-03-01 193 0 9979.
40 1952-04-01 181 0 10247.
41 1952-05-01 183 0 10506.
42 1952-06-01 218 0 10774.
43 1952-07-01 230 0 11033.
44 1952-08-01 242 0 11301.
45 1952-09-01 209 0 11569.
46 1952-10-01 191 0 11828.
47 1952-11-01 172 0 12096
48 1952-12-01 194 0 12355.
49 1953-01-01 196 0 12623.
50 1953-02-01 196 0 12891.
51 1953-03-01 236 0 13133.
52 1953-04-01 235 0 13401.
53 1953-05-01 229 0 13660.
54 1953-06-01 243 0 13928.
55 1953-07-01 264 0 14187.
56 1953-08-01 272 0 14455.
57 1953-09-01 237 0 14723.
58 1953-10-01 211 0 14982.
59 1953-11-01 180 0 15250.
60 1953-12-01 201 0 15509.
61 1954-01-01 204 0 15777.
62 1954-02-01 188 0 16044.
63 1954-03-01 235 0 16286.
64 1954-04-01 227 0 16554.
65 1954-05-01 234 0 16813.
66 1954-06-01 264 0 17081.
67 1954-07-01 302 0 17340.
68 1954-08-01 293 0 17608.
69 1954-09-01 259 0 17876.
70 1954-10-01 229 0 18135.
71 1954-11-01 203 0 18403.
72 1954-12-01 229 0 18662.
73 1955-01-01 242 0 18930.
74 1955-02-01 233 0 19198.
75 1955-03-01 267 0 19440
76 1955-04-01 269 0 19708.
77 1955-05-01 270 0 19967.
78 1955-06-01 315 0 20235.
79 1955-07-01 364 0 20494.
80 1955-08-01 347 0 20762.
81 1955-09-01 312 0 21030.
82 1955-10-01 274 0 21289.
83 1955-11-01 237 0 21557.
84 1955-12-01 278 0 21816
> print(valid,n=100)
# A tibble: 48 x 4
# Groups: key [1]
date target key self_adding_date_time_since_bg
<date> <dbl> <fct> <dbl>
1 1956-01-01 284 0 22084.
2 1956-02-01 277 0 22352.
3 1956-03-01 317 0 22602.
4 1956-04-01 313 0 22870.
5 1956-05-01 318 0 23129.
6 1956-06-01 374 0 23397.
7 1956-07-01 413 0 23656.
8 1956-08-01 405 0 23924.
9 1956-09-01 355 0 24192
10 1956-10-01 306 0 24451.
11 1956-11-01 271 0 24719.
12 1956-12-01 306 0 24978.
13 1957-01-01 315 0 25246.
14 1957-02-01 301 0 25514.
15 1957-03-01 356 0 25756.
16 1957-04-01 348 0 26024.
17 1957-05-01 355 0 26283.
18 1957-06-01 422 0 26551.
19 1957-07-01 465 0 26810.
20 1957-08-01 467 0 27078.
21 1957-09-01 404 0 27346.
22 1957-10-01 347 0 27605.
23 1957-11-01 305 0 27873.
24 1957-12-01 336 0 28132.
25 1958-01-01 340 0 28400.
26 1958-02-01 318 0 28668.
27 1958-03-01 362 0 28909.
28 1958-04-01 348 0 29177.
29 1958-05-01 363 0 29436.
30 1958-06-01 435 0 29704.
31 1958-07-01 491 0 29964.
32 1958-08-01 505 0 30231.
33 1958-09-01 404 0 30499.
34 1958-10-01 359 0 30758.
35 1958-11-01 310 0 31026.
36 1958-12-01 337 0 31285.
37 1959-01-01 360 0 31553.
38 1959-02-01 342 0 31821.
39 1959-03-01 406 0 32063.
40 1959-04-01 396 0 32331.
41 1959-05-01 420 0 32590.
42 1959-06-01 472 0 32858.
43 1959-07-01 548 0 33117.
44 1959-08-01 559 0 33385.
45 1959-09-01 463 0 33653.
46 1959-10-01 407 0 33912
47 1959-11-01 362 0 34180.
48 1959-12-01 405 0 34439.
> print(test,n=100)
# A tibble: 12 x 4
# Groups: key [1]
date target key self_adding_date_time_since_bg
<date> <dbl> <fct> <dbl>
1 1960-01-01 417 0 34707.
2 1960-02-01 391 0 34975.
3 1960-03-01 419 0 35225.
4 1960-04-01 461 0 35493.
5 1960-05-01 472 0 35752.
6 1960-06-01 535 0 36020.
7 1960-07-01 622 0 36279.
8 1960-08-01 606 0 36547.
9 1960-09-01 508 0 36815.
10 1960-10-01 461 0 37074.
11 1960-11-01 390 0 37342.
12 1960-12-01 432 0 37601.
> spec
A <prepared_tft_dataset_spec> with:
v lookback = 48 and horizon = 12.
v The number of possible slices is 73
-- Covariates:
v `index`: date
v `keys`: key
v `static`:
v `known`: self_adding_date_time_since_bg
v `unknown`:
i Variables that are not specified in other types are considered `unknown`.
past_data = bind_rows(train, valid) new_data = test But I think past_data does not leak (index,key).
> pred <- predict(object = fitted, new_data = test, past_data = bind_rows(train, valid))
Error in `verify_new_data()`:
! `new_data` includes obs that we can't generate predictions.
x Found `12` observations.
I also found the same error even though my test data date is ahead of the train data.
This is a bug. Would be nice if you could paste a reproducible example so it's quicker for us to debug.
https://github.com/jbrownlee/Datasets/blob/master/airline-passengers.csv this data it is very simple. train 84 rows valid 48 rows The remaining 12 rows are predictions.
Thanks
What is the cause of this error? It may be a coincidence, but this error seems to occur with other data as well, apparently monthly data.
For example, the famous `AirlinePassengers.csv' also gives this error