mdancho84 / modeltime-iterative-forecasting

3 stars 6 forks source link

Changes in Data Preparation Function #1

Closed AlbertoAlmuinha closed 3 years ago

AlbertoAlmuinha commented 3 years ago

Hi @mdancho84 ,

The first step of data preparation is carried out by three functions. In the first one, extend_timeseries, a check is made on the existence of missing values and an error is thrown in case they exist because this field is used to make the subsequent filtering. I think that it would be interesting to change this error by a warning and to launch to the user the suggestion to perhaps impute them later in the workflow with recipes.

I have modified the function nest_timeseries so that the separation is not made based on the missing values and in this way it is not required to the user its imputation in a previous stage, so that in this way if it wishes it can impute them later with recipes.

This would be the function modified:

nest_timeseries <- function(.data, .id_var, .length_out) {

    id_var_expr    <- enquo(.id_var)

    # SPLIT FUTURE AND ACTUAL DATA

    future_data_tbl <- .data %>%
        panel_tail(id = !!id_var_expr, n = .length_out)

    groups <- future_data_tbl$id %>% unique() %>% length()

    n_group <- .data %>% group_by(!!id_var_expr) %>% summarise(n = n() - (dim(future_data_tbl)[1]/groups))

    actual_data_tbl <- .data %>%
        inner_join(n_group, by = rlang::quo_name(id_var_expr)) %>%
        group_by(!!id_var_expr) %>%
        slice(seq(first(n))) %>%
        ungroup()

    # CHECKS
    if (nrow(future_data_tbl) == 0) {
        rlang::warn("Future Data is `NULL`. Try using `extend_timeseries()` to add future data.")
    }

    # NEST

    ret_1 <- actual_data_tbl %>%
        nest(.actual_data = - (!! id_var_expr))

    ret_2 <- future_data_tbl %>%
        nest(.future_data = - (!! id_var_expr))

    # JOIN

    id_col_text <- names(ret_1)[[1]]

    ret <- left_join(ret_1, ret_2, by = id_col_text)

    return(ret)

}
mdancho84 commented 3 years ago

Ok, I will test it out and report back. Thanks for this!

mdancho84 commented 3 years ago

This one was easy. Done.

mdancho84 commented 3 years ago

@AlbertoAlmuinha I had to revert to the previous nest_timeseries() function because the future forecast was resulting in an error. I still need to do some debugging, but I wanted to have working code the review.

AlbertoAlmuinha commented 3 years ago

Ok, I can take a look at this to see what is happening!

mdancho84 commented 3 years ago

I'll look into it too this weekend. Keep me posted if you find anything. I was thinking of making yours a nest_timeseries2() and comparing differences in the resulting objects with the waldo package. https://www.tidyverse.org/blog/2020/10/waldo/

AlbertoAlmuinha commented 3 years ago

I didn't know that package, I think it's a good idea, it can make my task much easier. If I find out anything I will let you know.

AlbertoAlmuinha commented 3 years ago

That's it, I've fixed the problem. That package is a real JEWEL! Thanks for recommending it to me so I can add it to my artillery. The problem is that it was adding an additional "n" column in actual with a fixed number which was the count of the data, simply adding a select(-n) both nested_df are now exactly the same and I have run all the code and have not had any problems. This would be the new version:

nest_timeseries <- function(.data, .id_var, .length_out) {

    id_var_expr    <- enquo(.id_var)

    # SPLIT FUTURE AND ACTUAL DATA

    future_data_tbl <- .data %>%
        panel_tail(id = !!id_var_expr, n = .length_out)

    groups <- future_data_tbl$id %>% unique() %>% length()

    n_group <- .data %>% group_by(!!id_var_expr) %>% summarise(n = n() - (dim(future_data_tbl)[1]/groups))

    actual_data_tbl <- .data %>%
        inner_join(n_group, by = rlang::quo_name(id_var_expr)) %>%
        group_by(!!id_var_expr) %>%
        slice(seq(first(n))) %>%
        ungroup() %>%
        select(-n)

    # CHECKS
    if (nrow(future_data_tbl) == 0) {
        rlang::warn("Future Data is `NULL`. Try using `extend_timeseries()` to add future data.")
    }

    # NEST

    ret_1 <- actual_data_tbl %>%
        nest(.actual_data = - (!! id_var_expr))

    ret_2 <- future_data_tbl %>%
        nest(.future_data = - (!! id_var_expr))

    # JOIN

    id_col_text <- names(ret_1)[[1]]

    ret <- left_join(ret_1, ret_2, by = id_col_text)

    return(ret)

}
mdancho84 commented 3 years ago

Oh, wow. That was quick. I will give it a try shortly.

mdancho84 commented 3 years ago

It works now. Thanks!