Closed paxcema closed 1 year ago
Greetings,
I can confirm that this PR speeds-up the learning by at least 15% in all test datasets.
The speed-up is less noticeable when dealing with larger datasets (>10K rows) and seems to decrease non-linearly with the number of rows used. In particular, the gain drops from more than 20% to barely above 10% when scaling the number of rows from 10000 to 100000. I interpret this as an indication that the performance issues when dealing with large datasets is not entirely solved by this PR.
Finally, judging by the results of the unit tests (nicely done!) the changes that were implemented seem to break some of the functionality used for time series analysis/models.
Great! I'll get the tests passing and merge.
Changelog
RandomForestMixer
: reduce number of hyperparameter search trials from 20 to 5filter_ds
tofilter_ts
. Change signature and usage so that it directly interacts withpd.DataFrame
s instead.Predictor
code that maintains threeEncodedDs
objects through thelearn()
pipeline, deleting them at the end to avoid bloated binaries. This helps reduce the number of instantiations, as all invocations are done throughfeaturize()
, which checks (and potentially uses) the cache first.EncodedDs
object insideanalysis
, reverting to use the predictor's cache instead.Effect:
This reduces
learn()
runtime anywhere from 15% to 50% depending on the dataset and predictor properties.