[ENH] Top level cache for Predictor.featurize()

paxcema commented 1 year ago

Changelog

RandomForestMixer: reduce number of hyperparameter search trials from 20 to 5
Rename filter_ds to filter_ts. Change signature and usage so that it directly interacts with pd.DataFrames instead.
Setting up a cache in the generated Predictor code that maintains three EncodedDs objects through the learn() pipeline, deleting them at the end to avoid bloated binaries. This helps reduce the number of instantiations, as all invocations are done through featurize(), which checks (and potentially uses) the cache first.
Avoid duplicate instantiation of validation EncodedDs object inside analysis, reverting to use the predictor's cache instead.

Effect:

This reduces learn() runtime anywhere from 15% to 50% depending on the dataset and predictor properties.

pafluxa commented 1 year ago

Greetings,

I can confirm that this PR speeds-up the learning by at least 15% in all test datasets.

The speed-up is less noticeable when dealing with larger datasets (>10K rows) and seems to decrease non-linearly with the number of rows used. In particular, the gain drops from more than 20% to barely above 10% when scaling the number of rows from 10000 to 100000. I interpret this as an indication that the performance issues when dealing with large datasets is not entirely solved by this PR.

Finally, judging by the results of the unit tests (nicely done!) the changes that were implemented seem to break some of the functionality used for time series analysis/models.

paxcema commented 1 year ago

Great! I'll get the tests passing and merge.

mindsdb / lightwood

[ENH] Top level cache for Predictor.featurize() #1145

Changelog

Effect: