koaning / scikit-lego

Extra blocks for scikit-learn pipelines.
https://koaning.github.io/scikit-lego/
MIT License
1.27k stars 117 forks source link

[FEATURE] Wrapping any estimator for caching at fit, predict, and transform time. #706

Open antngh opened 1 week ago

antngh commented 1 week ago

Please let me know if this is not the correct place or way to start this discussion.

I have some code for a wrapper to an estimator (transformer or predictor) that quickly saves the object and the data to disk. If the wrapped estimator is called with an identical instance (same properties etc) and with the same input data, then it will fetch from disk rather than rerunning the corresponding fit/predict/transform/(etc) code. The wrapped estimator behaves exactly as an estimator would in all the cases I've tested.

Sklearn provides something similar with the memory arg in the pipeline class, but it doesn't extend to inference, only fitting, and even then, it won't apply to the last transformer in the pipeline.

This is especially useful for when there is an estimator that has a slow running predict/transform step, and you want to run this pipeline quickly. It will run again if needed (if either the estimator or the input data has changed), but otherwise will just load from file. This also maintains it across runs - if you restart the kernel or run the script again, you can pick up where you left off. This isn't intended as a data store. It can really speed up the development of pipelines when there are slow steps.

Please let me know if this functionality in full or in part - say the code that checks if the data is the same - could be useful, and I will look and adding it to this repo. I can't commit to fully maintain it going forward, but as of now it seems to work well.

antngh commented 1 week ago

You can see the code here : https://github.com/antngh/sklearn-estimator-caching

FBruzzesi commented 1 week ago

Hey @antngh , thanks for the issue and for already putting the effort into this. I took a sneak peak at the repo, and just by the volume it could deserve to be in its own project/repo.

I can imagine people having multiple use cases for such caching mechanism, and therefore different feature request for it.

I will wait for @koaning to weight on this as well.

koaning commented 1 week ago

I have also observed pipelines becomes slower with caching on the sklearn side. If the numpy array going in is huge, the hashing might actually be slower than the pipeline. Doesn't happen all the time, but worth to keep in the back of your mind.

I wonder, if the final element of a pipeline is skipped, why not add a FunctionTransformer at the end? That behaves like an identity function if you pass no arguments, but it will act as a "final" transformer. Does that not work with the memory flag in a normal pipeline?

Another sensible way to cache an intermediate output is to manually store a transformer array in memory or to write it to disk manually from there. This only works if you know exactly what needs to be remembered and if it does not change, but it might be easier to reason about compared to the hashing involved with a caching mechanism. I am personally a little hestitant to support it here because I am a bit wary of all the edge cases.

But if you have a compelling benchmark, I'd sure be all ears!

antngh commented 1 week ago

Thanks both.

Some timing here: https://github.com/antngh/sklearn-estimator-caching/blob/main/notebooks/caching_timing.ipynb For 10 million rows, and 10 columns, the overhead added from the wrapper is around 4 seconds (on par or slightly less than using a pipeline with memory). The second call is similar, whereas for the pipeline with memory it is about 1.5 seconds. (Edit: for 100M rows my code takes around 2 mins on both first and second call, and sklearn's memory uses about 1 minute on first call and 20 seconds on the second call).

The overhead for the inference calls are about the same as for the fit calls (pipeline has no equivalent caching here).

I first created this code specifically due to some very slow custom transformers I was working with. In my case it wasn't a matter of a normal transformer with a huge dataset, but rather a big dataset but a super slow transformation step. In that case I see a huge improvement when using this wrapper. You're right we could manually save/load the data but that quickly becomes hard to track and manage.

I am a bit wary of all the edge cases.

I fully understand.