Dealing with very large samples sizes and time series

hendersontrent / theft

R package for Tools for Handling Extraction of Features from Time series (theft)

https://hendersontrent.github.io/theft/

Other

39 stars 5 forks source link

Dealing with very large samples sizes and time series #251

Open hummuscience opened 1 year ago

hummuscience commented 1 year ago

Thinking of ways to revive this I was trying out theft with a small subset of my data to see how well it runs.

Now that I am trying to scale this up to a larger dataset, things are starting to take a lot of time (3k samples with 4k time points). For example, I am currently only calculating catch22 and tsfeatures. feasts is taking very long to the point that I am skipping it. The python packages are having troubles runnings (I will dig a big deeper there and make a separate issue).

Even though tsfeatures is running, it takes about 30 minutes on my machine (Ryzen 7 4800H with 64Gb of RAM). I saw on the tsfeatures GitHub repo that it supports parallel computing (https://github.com/robjhyndman/tsfeatures/blob/master/R/featurematrix.R) by default, but it doesn't seem to utilize the cores correctly. Maybe I need to change to future::multicore instead, since I am on Unix.

I tried scaling now to the full size of my data (12k samples, 4k time points) and things are getting tricky. I suspect that it would be possible to do, but without the dplyr steps/pipes in between as they blow up, the long format does not handle such sizes that well.

I think I will batch process for now but maybe it makes sense to have an internal solution for this.

hendersontrent commented 1 year ago

Yeah long computation time is an issue I have been grappling with. tsfeatures is by far the slowest (see my paper for more. I have gotten a long way with recoding the slowest features in tsfeatures into C++ in this repo but it is unlikely that it would ever be incorporated into the broader package so it's currently just a personal package. One of the core issues cannot be solved by parallelisation -- some features are just extremely slow, in that they either fit complex statistical models or they search over the T space of timepoints to calculate a quantity. The latter is sometimes slow in R.

I definitely agree that theft's internals could be optimised for performance, and much of the calculate_features code is "legacy" from when I first started it and only had the R packages in there and was working with smaller datasets.

hummuscience commented 1 year ago

Probably need to do some benchmarking to find out which parts of the code would benefit the most from a performance boost.

As for the dplyr pipes, using dtplyr which provides a dplyr front-end to data.table, might already lead to a performance boost (and is also part of the tidyverse).

hendersontrent commented 1 year ago

Definitely the internal calculator functions that compute each set I believe. They are mostly just dplyr::group_by %>% dplyr::summarise calls where the summarise call contains the call to each package's respective feature calculation function. I'm sure changing to dtplyr or even a crafty purrr::map_dfr or something might provide improvements. There is definitely a ceiling we can get to before the issues with slow individual features in R prevents further gains (except for catch22 as it's coded in C within Rcatch22). I unfortunately don't have time to work on this at the moment due to PhD stuff just applying theft (but I hope to get to improvements like this sometime soon!), but I would love to help in any way I can if you decide to give it a go.