tidyverts / fabletools

General fable features useful for extension packages
http://fabletools.tidyverts.org/
89 stars 31 forks source link

add_features family? #92

Closed njtierney closed 5 years ago

njtierney commented 5 years ago

Hello!

I find the following pattern common in brolgar:

library(brolgar)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
wages_ts %>%
  features(ln_wages, feat_monotonic) %>%
  left_join(wages_ts, by = tsibble::key_vars(wages_ts))
#> # A tibble: 6,402 x 13
#>       id increase decrease unvary monotonic ln_wages    xp   ged postexp
#>    <int> <lgl>    <lgl>    <lgl>  <lgl>        <dbl> <dbl> <int>   <dbl>
#>  1    31 FALSE    FALSE    FALSE  FALSE         1.49 0.015     1   0.015
#>  2    31 FALSE    FALSE    FALSE  FALSE         1.43 0.715     1   0.715
#>  3    31 FALSE    FALSE    FALSE  FALSE         1.47 1.73      1   1.73 
#>  4    31 FALSE    FALSE    FALSE  FALSE         1.75 2.77      1   2.77 
#>  5    31 FALSE    FALSE    FALSE  FALSE         1.93 3.93      1   3.93 
#>  6    31 FALSE    FALSE    FALSE  FALSE         1.71 4.95      1   4.95 
#>  7    31 FALSE    FALSE    FALSE  FALSE         2.09 5.96      1   5.96 
#>  8    31 FALSE    FALSE    FALSE  FALSE         2.13 6.98      1   6.98 
#>  9    36 FALSE    FALSE    FALSE  FALSE         1.98 0.315     1   0.315
#> 10    36 FALSE    FALSE    FALSE  FALSE         1.80 0.983     1   0.983
#> # … with 6,392 more rows, and 4 more variables: black <int>,
#> #   hispanic <int>, high_grade <int>, unemploy_rate <dbl>

Created on 2019-07-16 by the reprex package (v0.3.0)

And so I wonder if it might be useful to consider an add_features family of functions, which perform this left_join based on the key of the data - perhaps implemented like so:

add_features <- function(.data, .var, features, ...){

    .var <- rlang::enquo(.var)
    if (rlang::quo_is_null(.var)) {
      rlang::inform(sprintf("Feature variable not specified, automatically selected `.var = %s`", 
                     tsibble::measured_vars(.data)[1]))
      .var <- rlang::as_quosure(rlang::sym(tsibble::measured_vars(.data)[[1]]),
                                env = rlang::empty_env())
    }
    else if (purrr::possibly(purrr::compose(rlang::is_quosures, 
                                            rlang::eval_tidy), 
                             FALSE)(.var)) {
      rlang::abort("`features()` only supports a single variable. To compute features across multiple variables consider scoped variants like `features_at()`")
    }
    fablelite:::features_impl(.data, list(.var), features, ...) %>%
    left_join(.data, 
              by = tsibble::key_vars(.data))
}

library(brolgar)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
wages_ts %>% add_features(ln_wages, feat_monotonic)
#> # A tibble: 6,402 x 13
#>       id increase decrease unvary monotonic ln_wages    xp   ged postexp
#>    <int> <lgl>    <lgl>    <lgl>  <lgl>        <dbl> <dbl> <int>   <dbl>
#>  1    31 FALSE    FALSE    FALSE  FALSE         1.49 0.015     1   0.015
#>  2    31 FALSE    FALSE    FALSE  FALSE         1.43 0.715     1   0.715
#>  3    31 FALSE    FALSE    FALSE  FALSE         1.47 1.73      1   1.73 
#>  4    31 FALSE    FALSE    FALSE  FALSE         1.75 2.77      1   2.77 
#>  5    31 FALSE    FALSE    FALSE  FALSE         1.93 3.93      1   3.93 
#>  6    31 FALSE    FALSE    FALSE  FALSE         1.71 4.95      1   4.95 
#>  7    31 FALSE    FALSE    FALSE  FALSE         2.09 5.96      1   5.96 
#>  8    31 FALSE    FALSE    FALSE  FALSE         2.13 6.98      1   6.98 
#>  9    36 FALSE    FALSE    FALSE  FALSE         1.98 0.315     1   0.315
#> 10    36 FALSE    FALSE    FALSE  FALSE         1.80 0.983     1   0.983
#> # … with 6,392 more rows, and 4 more variables: black <int>,
#> #   hispanic <int>, high_grade <int>, unemploy_rate <dbl>

Created on 2019-07-16 by the reprex package (v0.3.0)

mitchelloharawild commented 5 years ago

I think I'd prefer to keep the join workflow at least for the first release.

earowang commented 5 years ago

add_features() outputs something different (or inconsistently) from features(). I could also do a right_join()/inner_join()/semi_join() on a subset of keyed series, not always left_join().