online-ml / river

🌊 Online machine learning in Python
https://riverml.xyz
BSD 3-Clause "New" or "Revised" License
4.89k stars 538 forks source link

Add decomposition methods OnlineSVD, OnlinePCA, OnlineDMD/wC + Hankelizer #1509

Open MarekWadinger opened 4 months ago

MarekWadinger commented 4 months ago

Hello @MaxHalford, @hoanganhngo610, and everyone 👋,

In https://github.com/online-ml/river/issues/1366, @MaxHalford showed interest in implementation of OnlinePCA and OnlineSVD methods in river.

Given my current project involvement with online decomposition methods, I believe the community could benefit from having access to these methods and their maintenance over time. Additionally, I am particularly interested in DMD, which combines the advantages of PCA and FFT. Hence, I propose the introduction of three new methods as part of the new decomposition module:

decomposition.OnlineSVD implemented based on Brand, M. (2006) (proposed by @MaxHalford in issue) with some considerations on re-orthogonalization. Since it is required quite often, compromising computation speed, it could be interesting to align with Zhang, Y. (2022) (I made some effort to implement but I'm yet to expore validity and possibility to implement revert in similar vein).

decomposition.OnlinePCA implemented based on Eftekhari, A. (2019) (proposed by @MaxHalford in issue), as it is currently state-of-the-art with all the proofs and guarantees. Would be happy to validate together if all considerations are handled in proposed OnlinePCA.

decomposition.OnlineDMD implemented based on Zhang, H. 2019. It can operate as MiniBatchTransformer, MiniBatchRegressor (sort of), and works with Rolling so I would need some help figuring out how we'd like to classify it (maybe new base class Decomposer.

Additionally, I propose preprocessing.Hankelizer, which could be beneficial for various regressors and particularly useful for enhancing feature space by introducing time-delayed embedding.

I've tried to include all necessary tests. However, I need to investigate why re-orthogonalization in OnlineSVD yields significantly different values when tested on various operating systems (locally, all tests pass).

Looking forward for your comments and revisions. 😌

MaxHalford commented 4 months ago

Wow incredible work! I have limited time at the moment (baby + work), but I will find some to review your PR.

In the meantime, could you please provide some benchmarks? I'm curious as to the throughput of each method. Generally speaking, we don't like to accept methods that heavily leverage numpy, but we could make an exception here. It really depends on how many rows per second can be processed. Sounds good?

MarekWadinger commented 4 months ago

@MaxHalford, thank you for the recognition! Your review would be priceless.

I will gladly provide benchmarks as I am aware that the reliance on numpy is much higher than usual methods in river. Nevertheless, I will also try to decrease it over time as I resolve some higher priority issues.

hoanganhngo610 commented 4 months ago

First of all, thank you so much @MarekWadinger for taking the initiative to create this PR, and I would say this is both a huge and incredible work.

From my side, I will also try my best to dedicate some time to review the PR, at least making sure that the code quality and layout aligns with what is currently available within River.

review-notebook-app[bot] commented 1 month ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

MarekWadinger commented 1 month ago

Hello @MaxHalford and @hoanganhngo610, 👋

I believe the methods are ready for benchmarking. The results are published in this notebook.

In the plot I combine two checks, performance w.r.t. number of features and delay imposed by conversion from pd.DataFrame (dict) to np.array used in the core. perf-pd_np-n_features

Mean absolute number of processed samples per second is provided here (for n features in range(3,20) as it remains pretty stable):

The results in the notebook indicate that using pd.DataFrame slows down OnlinePCA, which is the fastest decomposition implementation, by up to 14%. However, I believe your concerns are likely related to the fact that the core of the decomposition methods works with np.arrays, correct?

What are your thoughts on the performance and adequacy of the evaluation?

Thanks for your time 🙏