tidyverts / fabletools

General fable features useful for extension packages
http://fabletools.tidyverts.org/
89 stars 31 forks source link

Should model training methods be given a tsibble or a vector? #74

Closed mitchelloharawild closed 5 years ago

mitchelloharawild commented 5 years ago

@earowang

https://github.com/tidyverts/fablelite/blob/29b75c1c7a6acb834a52ef0dd36e51bdc18a2241/R/estimate.R#L25

This step seems surprisingly expensive. On a simple LM on 888 keys, it makes up 27% of the computational time. I'm not exactly sure why this is, but it may be faster if we drop the tabular structure.

Do you think that the data argument in model training methods should be a vector or a tsibble? In most (maybe all?) cases we just extract this vector anyway. https://github.com/tidyverts/fable/blob/ea606bd729b41b47e3c08e16ccaf47445c6faaa4/R/arima.R#L18 The tsibble attributes may be useful for some special case models, however the raw data can be accessed from the R6 object using self.

earowang commented 5 years ago

I think this roots in eval_tidy() call that is very expensive. I assume you say the interface remains the same, but internally handles a vector?

mitchelloharawild commented 5 years ago

Yes, that is correct.

On Wed., 5 Jun. 2019, 2:18 pm Earo Wang, notifications@github.com wrote:

I think this roots in eval_tidy() call that is very expensive. I assume you say the interface remains the same, but internally handles a vector?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tidyverts/fablelite/issues/74?email_source=notifications&email_token=AD3BJF3E66AVTAE5DYO5PCDPY45ABA5CNFSM4HTNBNYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW6RWSY#issuecomment-498932555, or mute the thread https://github.com/notifications/unsubscribe-auth/AD3BJF5NX3KQQSUAWGCO4Q3PY45ABANCNFSM4HTNBNYA .

earowang commented 5 years ago

Sounds reasonable.

mitchelloharawild commented 5 years ago

I wonder why the overhead of transmute is so high here. As suspected:

library(rlang)
library(tidyverse)
.data <- tsibble::as_tsibble(USAccDeaths)
parsed <- list()
parsed$expressions <- exprs(log(value))

.env <- env_bury(rlang::global_env(), data = .data, transmute = transmute)
bench::mark(
  eval_tidy(expr(transmute(!!sym("data"), !!!parsed$expressions)), env = .env),
  eval_tidy(parsed$expressions[[1]], data = .data, env = rlang::global_env()),
  check = FALSE
)
#> # A tibble: 2 x 10
#>   expression     min    mean  median     max `itr/sec` mem_alloc  n_gc
#>   <chr>      <bch:t> <bch:t> <bch:t> <bch:t>     <dbl> <bch:byt> <dbl>
#> 1 "eval_tid…  7.09ms  7.75ms  7.37ms  13.6ms      129.        0B    11
#> 2 eval_tidy… 12.18µs 14.13µs 13.36µs 161.9µs    70793.        0B     6
#> # … with 2 more variables: n_itr <int>, total_time <bch:tm>

edit: Seems like select is fairly expensive? Created on 2019-06-05 by the reprex package (v0.2.1)

mitchelloharawild commented 5 years ago

Rebuilding the tsibble is expensive here, even when using validate=FALSE.

Instead because the attributes shouldn't change in this calculation, I'm best off using this:

library(rlang)
library(tidyverse)
library(tsibble)
#> 
#> Attaching package: 'tsibble'
#> The following object is masked from 'package:dplyr':
#> 
#>     id
.data <- as_tsibble(USAccDeaths)
parsed <- list()
parsed$expressions <- exprs(log(value))

.env <- env_bury(rlang::global_env(), data = .data, transmute = transmute)
bench::mark(
  eval_tidy(expr(transmute(!!sym("data"), !!!parsed$expressions)), env = .env),
  {
    .dt_attr <- attributes(.data)
    out <- unclass(.data)[expr_text(index(.data))]
    out[[expr_text(parsed$expressions[[1]])]] <- eval_tidy(parsed$expressions[[1]], data = .data, env = rlang::global_env())
    attributes(out) <- .dt_attr
    out
  },
  check = FALSE
)
#> # A tibble: 2 x 10
#>   expression    min    mean  median    max `itr/sec` mem_alloc  n_gc n_itr
#>   <chr>      <bch:> <bch:t> <bch:t> <bch:>     <dbl> <bch:byt> <dbl> <int>
#> 1 "eval_tid… 17.1ms  26.8ms  26.8ms 33.2ms      37.3        0B     4    13
#> 2 {...       96.9µs   139µs 107.3µs  2.1ms    7194.         0B     9  3352
#> # … with 1 more variable: total_time <bch:tm>

Created on 2019-06-05 by the reprex package (v0.2.1)

mitchelloharawild commented 5 years ago

With performance issues fixed I don't see this being an issue now. As for whether a tsibble or a vector is more appropriate (beyond performance), I'm not sure.