Open hummuscience opened 1 year ago
Yeah long computation time is an issue I have been grappling with. tsfeatures
is by far the slowest (see my paper for more. I have gotten a long way with recoding the slowest features in tsfeatures
into C++ in this repo but it is unlikely that it would ever be incorporated into the broader package so it's currently just a personal package. One of the core issues cannot be solved by parallelisation -- some features are just extremely slow, in that they either fit complex statistical models or they search over the T space of timepoints to calculate a quantity. The latter is sometimes slow in R.
I definitely agree that theft
's internals could be optimised for performance, and much of the calculate_features
code is "legacy" from when I first started it and only had the R packages in there and was working with smaller datasets.
Probably need to do some benchmarking to find out which parts of the code would benefit the most from a performance boost.
As for the dplyr pipes, using dtplyr which provides a dplyr front-end to data.table, might already lead to a performance boost (and is also part of the tidyverse).
Definitely the internal calculator functions that compute each set I believe. They are mostly just dplyr::group_by %>% dplyr::summarise
calls where the summarise
call contains the call to each package's respective feature calculation function. I'm sure changing to dtplyr
or even a crafty purrr::map_dfr
or something might provide improvements. There is definitely a ceiling we can get to before the issues with slow individual features in R prevents further gains (except for catch22
as it's coded in C within Rcatch22
). I unfortunately don't have time to work on this at the moment due to PhD stuff just applying theft
(but I hope to get to improvements like this sometime soon!), but I would love to help in any way I can if you decide to give it a go.
Thinking of ways to revive this I was trying out theft with a small subset of my data to see how well it runs.
Now that I am trying to scale this up to a larger dataset, things are starting to take a lot of time (3k samples with 4k time points). For example, I am currently only calculating catch22 and tsfeatures. feasts is taking very long to the point that I am skipping it. The python packages are having troubles runnings (I will dig a big deeper there and make a separate issue).
Even though tsfeatures is running, it takes about 30 minutes on my machine (Ryzen 7 4800H with 64Gb of RAM). I saw on the tsfeatures GitHub repo that it supports parallel computing (https://github.com/robjhyndman/tsfeatures/blob/master/R/featurematrix.R) by default, but it doesn't seem to utilize the cores correctly. Maybe I need to change to future::multicore instead, since I am on Unix.
I tried scaling now to the full size of my data (12k samples, 4k time points) and things are getting tricky. I suspect that it would be possible to do, but without the dplyr steps/pipes in between as they blow up, the long format does not handle such sizes that well.
I think I will batch process for now but maybe it makes sense to have an internal solution for this.