best method for downsampling? #2

Open jeff-goldsmith opened 2 years ago

jeff-goldsmith commented 2 years ago

if you have functions measured over a rich grid, you might want to downsample (e.g. go from minute-level wearable device data to 5 minute or one-hour increments). if that's your goal, you might prefer to average over bins -- but i don't think there's a good way to do that right now, is there?

tf_evaluate lets you evaluate on a new domain, but uses interpolation rather than averaging. and tf_integrate could work, sort of, in that you get average value between lower and upper -- but it produces a scalar, and you'd have to do some kind of loop.

for what it's worth, my current work around is to unnest, aggregate, then nest and re-merge. something like:

hour_data = 
  activity_df %>% 
  select(id, activity) %>% 
  tf_unnest(activity) %>% 
  mutate(hour = floor((activity_arg -1) / 60)) %>% 
  group_by(id, hour) %>% 
  summarize(act_hour = mean(activity_value)) %>% 
  tf_nest(.id = id, .arg = hour)
fabian-s commented 2 years ago

you can get running medians or averages from tf_smooth, and then only keep the arg-vals you care about. EDIT: but in that version, we do need a function to put the tf-object on a different domain measured in different units --> tidyfun/tf#6....

#> Registered S3 method overwritten by 'GGally':
#>   method from   
#>   +.gg   ggplot2
#> Attaching package: 'tidyfun'
#> The following objects are masked from 'package:stats':
#>     sd, var

hour_data <- chf_df %>% filter(day == "Mon") |> 
  select(id, activity) |> 
  tf_unnest(activity) %>% 
  mutate(hour = floor((activity_arg -1) / 60)) %>% 
  group_by(id, hour) %>% 
  summarize(act_hour = mean(activity_value)) %>% 
  tf_nest(.id = id, .arg = hour)
#> `summarise()` has grouped output by 'id'. You can override using the `.groups`
#> argument.

hour_data2 <- chf_df %>% filter(day == "Mon") |> 
  select(id, activity) |> 
  mutate(act_hour2 = 
           tf_smooth(activity, method = "rollmean", k = 60, align = "right") |> 
           tfd(arg = seq(60, 1440, by = 60)))  |> 
#> setting fill = 'extend' for start/end values.

ggplot(hour_data) + geom_spaghetti(aes(y = act_hour))

ggplot(hour_data2) + geom_spaghetti(aes(y = act_hour2))

Created on 2022-04-26 by the reprex package (v2.0.1)

jeff-goldsmith commented 2 years ago

makes sense! if this turns out to be a common operation, we could introduce something to handle this directly -- or at least update a vignette somewhere.

jeff-goldsmith commented 2 years ago

strange lil update on this: if you unnest and re-nest the result of this binning process, you can plot the created tf object but can't print it.

#> Registered S3 method overwritten by 'GGally':
#>   method from   
#>   +.gg   ggplot2
#> Attaching package: 'tidyfun'
#> The following objects are masked from 'package:stats':
#>     sd, var

hour_data2 <- 
  chf_df %>% 
  filter(day == "Mon") |> 
  select(id, activity) |> 
  mutate(act_hour = 
           tf_smooth(activity, method = "rollmean", k = 60, align = "right") |> 
           tfd(arg = seq(60, 1440, by = 60)))  |> 
  select(-activity) %>% 
  tf_unnest(act_hour) %>% 
  tf_nest(act_hour_value, .id = id, .arg = act_hour_arg)
#> setting fill = 'extend' for start/end values.

ggplot(hour_data2) + geom_spaghetti(aes(y = act_hour_value))

#> Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width, 3L, : invalid value 0 for 'digits' argument

Created on 2022-05-31 by the reprex package (v2.0.1)

if you use tfd(arg = seq(1, 1440, by = 1)), everything seems to work ...

guessing the issue is inside print-format.R but don't have a handle on what's going on.

fabian-s commented 2 years ago

can't reproduce this with tidyfun@42cb5d8 (latest commit in dev) -- what package versions are you using?

#> Registered S3 method overwritten by 'GGally':
#>   method from   
#>   +.gg   ggplot2
#> Attaching package: 'tidyfun'
#> The following objects are masked from 'package:stats':
#>     sd, var

hour_data2 <- 
  chf_df %>% 
  filter(day == "Mon") |> 
  select(id, activity) |> 
  mutate(act_hour = 
           tf_smooth(activity, method = "rollmean", k = 60, align = "right") |> 
           tfd(arg = seq(60, 1440, by = 60)))  |> 
  select(-activity) %>% 
  tf_unnest(act_hour) %>% 
  tf_nest(act_hour_value, .id = id, .arg = act_hour_arg)
#> setting fill = 'extend' for start/end values.
ggplot(hour_data2) + geom_spaghetti(aes(y = act_hour_value))

#> # A tibble: 47 × 2
#>       id                     act_hour_value
#>    <dbl>                          <tfd_reg>
#>  1     1  [1]: ( 60,0);(120,1);(180,0); ...
#>  2     3  [2]: ( 60,6);(120,5);(180,4); ...
#>  3     4  [3]: ( 60,2);(120,2);(180,1); ...
#>  4     5  [4]: ( 60,6);(120,6);(180,5); ...
#>  5     6  [5]: ( 60,1);(120,0);(180,1); ...
#>  6     7  [6]: ( 60,1);(120,1);(180,1); ...
#>  7     8  [7]: ( 60,0);(120,0);(180,0); ...
#>  8     9  [8]: ( 60,0);(120,1);(180,0); ...
#>  9    10  [9]: ( 60,1);(120,1);(180,1); ...
#> 10    11 [10]: ( 60,1);(120,1);(180,1); ...
#> # … with 37 more rows
> sessioninfo::session_info()
─ Session info ─────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.1.0 (2021-05-18)
 os       Linux Mint 20
 system   x86_64, linux-gnu
 ui       RStudio
 language en_US
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/Berlin
 date     2022-06-01
 rstudio  1.4.1717 Juliet Rose (desktop)
 pandoc   2.11.4 @ /usr/lib/rstudio/bin/pandoc/ (via rmarkdown)

─ Packages ─────────────────────────────────────────────────────────────────────────
jeff-goldsmith commented 2 years ago

definitely a package version issue -- i upgraded to R 4.2.0 on the machine where it breaks, and also can't reproduce the issue elsewhere.


hour_data2 <- 
  chf_df %>% 
  filter(day == "Mon") |> 
  select(id, activity) |> 
  mutate(act_hour = 
           tf_smooth(activity, method = "rollmean", k = 60, align = "right") |> 
           tfd(arg = seq(60, 1440, by = 60)))  |> 
  select(-activity) %>% 
  tf_unnest(act_hour) %>% 
  tf_nest(act_hour_value, .id = id, .arg = act_hour_arg)
#> Error in tf_nest(., act_hour_value, .id = id, .arg = act_hour_arg): could not find function "tf_nest"

ggplot(hour_data2) + geom_spaghetti(aes(y = act_hour_value))
#> Error in ggplot(hour_data2): object 'hour_data2' not found

#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.0 (2022-04-22)
#>  os       macOS Big Sur/Monterey 10.16
#>  system   x86_64, darwin17.0
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/New_York
#>  date     2022-06-01
#>  pandoc @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/ (via rmarkdown)
#> ─ Packages ───────────────────────────────────────────────────────────────────
Created on 2022-06-01 by the reprex package (v2.0.1)

fabian-s commented 2 years ago

definitely a package version issue

those are my FAVORITES :-1:

fabian-s commented 2 years ago

makes sense! if this turns out to be a common operation, we could introduce something to handle this directly -- or at least update a vignette somewhere.

do this

fabian-s commented 6 months ago

i've added getter/setter functions that let you do this now -- @jeff-goldsmith is this roughly what you would want here or do we need a more convenient way to do this?

> library(tidyverse)
> act <- tidyfun::chf_df |> pull(activity)
> act_hour <- tf_smooth(act, method = "rollmean", k = 60, align = "right") |> 
+            tfd(arg = seq(60, 1440, by = 60))
setting fill = 'extend' for start/end values.
> tf_domain(act_hour) <- tf_domain(act_hour)/60
Warning message:
In `tf_domain<-`(`*tmp*`, value = c(1, 24)) :
  This changes the functions' domain but not the argument values!
To restrict functions to a part of their domain, use tf_zoom.
> tf_arg(act_hour) <- tf_arg(act_hour)/60
Warning message:
In `tf_arg<-`(`*tmp*`, value = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10,  :
  This changes arguments (and resolution) without changing the corresponding function values!
In order to re-evaluate functions on a new grid, use tf_interpolate.
> act_hour
tfd[329] on (1,24) based on 24 evaluations each
interpolation by tf_approx_linear 
[1]: (1,0.4);(2,1.2);(3,0.3); ...
jeff-goldsmith commented 6 months ago

I think this helps! The downsampling is primarily done through tf_smooth() but having finer control over domain and arg through these functions will help here and in a lot of other settings.