haifengl / smile

Statistical Machine Intelligence & Learning Engine
https://haifengl.github.io
Other
6.03k stars 1.13k forks source link

Unsupported Operation for SVR update with a mini-batch of new samples #777

Closed Nicol273 closed 2 months ago

Nicol273 commented 3 months ago

First, I want to thank you for creating and maintaining this library. It has been quite useful to me.

However, I'm currently facing an issue with a missing method. Despite understanding that the execution may not be multi-thread safe, I'm still struggling with this limitation, particularly because I am working with a large dataset and there are new samples added weekly:

public interface Regression ...

/**

Could you please provide some guidance on how to handle this? Thank you in advance for your help and support.

haifengl commented 3 months ago

SVR is a batch algorithm, which doesn't support online learning.

Nicol273 commented 3 months ago

I was asking about it because of this research paper: Accurate On-line Support Vector Regression. Is it possible to combine the approach from the paper with the SVR algorithm that the SMILE library provides?

haifengl commented 3 months ago

Thanks for reference. SVM/SVR is better for small dataset with large number of features. Otherwise, it is better to use other algorithms. If your data keep growing, SVR won't be able to handle it soon (don't matter online learning or matter). It is because the number of support vectors grow linearly too. The inference speed will be bad.

Nicol273 commented 3 months ago

Thanks, I found a way to solve my problem, but I'm curious how many samples approximately are considered a small dataset?

haifengl commented 3 months ago

"small" or "big" is subjective and the preception changes over time. To me, SVM/SVR is suitable for sample size at order of tens of thousands, sometimes may be 100,000. It struggles beyond that.

How did you solve your problem?

Nicol273 commented 3 months ago

The data exhibits seasonal trends, so I divided the data into 3-month segments. Instead of training a model on data from the entire year, I trained a separate model for each season, using data from the same season in the past two years. In the worst-case scenario I have around 50,000 samples.

haifengl commented 2 months ago

Thanks for sharing!