online-ml / river

🌊 Online machine learning in Python
https://riverml.xyz
BSD 3-Clause "New" or "Revised" License
4.99k stars 540 forks source link

How to save a model to disk? #1011

Closed shelovemee closed 2 years ago

shelovemee commented 2 years ago

Hi,

Thanks for this great library. Is there any way to save a model to disk? And is there any way to use a river model in C++?

smastelini commented 2 years ago

Hi @shelovemee, please check Discussion #464 for an answer about saving models.

Unfortunately, we don't have a wrapper from Python to C++.

shelovemee commented 2 years ago

Hi @smastelini

Thanks for your quick reply. I have one more question. I have a 150,000 rows of data trained with river, and after the training I predict another 10 rows (not in the training set). But the prediction is slightly different every time, maybe 1 or 2 predictions are not the same every time, how can I solve this problem?

kulbachcedric commented 2 years ago

Hi @shelovemee,

Do you have a minimal, reproducible example of the pipeline you are using?

Best Cedric

MaxHalford commented 2 years ago

Hey there.

You're probably getting bitten by the fact that unsupervised parts of your model are updated when calling predict_one. This isn't a bug, it's a feature. There's a context manager that allows you to override this behavior if that's what you want, see here.

smastelini commented 2 years ago

Thank you guys! I was heading here to ask for a minimal example :)

Good catch!

shelovemee commented 2 years ago
  1. Thanks for your reply. I tried pure_inference_mode, but the AUC dropped from 0.94 to 0.50. When predicting another 10 rows, the results are all wrong. Am I doing something wrong? Here is my code snippet:
model = compose.Pipeline(preprocessing.StandardScaler(), linear_model.LogisticRegression())
metric = metrics.ROCAUC()
class_condition = lambda x: x.__class__.__name__ in ('StandardScaler', 'LinearRegression')

with utils.log_method_calls(class_condition), compose.pipeline.pure_inference_mode():
    for index, row in df.iterrows():
            x = dict(zip(ex, row.tolist()[:-1]))
            y_pred = model.predict_one(x)
             cy = row['TYPE']
            metric = metric.update(cy, y_pred)
            model = model.learn_one(x, cy)

pprint(metric)

There is a mistake in https://riverml.xyz/dev/api/compose/pure-inference-mode/, you should change "compose.pure_inference_mode" to "compose.pipeline.pure_inference_mode()".

  1. The purpose of using river is to allow the model to learn new data asap to improve the accuracy. I have tried many classifier in river(ensemble, liner_model), and the AUC is always around 0.94. But when I use XGBoost the AUC can easily reach above 0.99. I know that river's advantage is to learn the new data, but do I have to trade off the accuracy of predicting the old data against the new data? My company is currently using XGBoost, the incremental training in XGBoost is cumbersome and time-consuming, that's why I'm looking for other libraries that can replace it. Do you know of any examples of using river in a production environment? I didn't find the usage part of river in the official website(riverml.xyz). Thank you.
MaxHalford commented 2 years ago

Thanks for your reply. I tried pure_inference_mode, but the AUC dropped from 0.94 to 0.50. When predicting another 10 rows, the results are all wrong. Am I doing something wrong?

Yes, you are. You should use pure_inference_mode just for the prediction part, not the learning part. It's just there for making sure ad-hoc predictions don't update your model. But you shouldn't use it in the training loop. Indeed you still want to update your unsupervised estimators in the training loop.

There is a mistake in https://riverml.xyz/dev/api/compose/pure-inference-mode/, you should change "compose.pure_inference_mode" to "compose.pipeline.pure_inference_mode()".

What makes you say that? The following works fine for me:

from river import compose

compose.pure_inference_mode

I have tried many classifier in river(ensemble, liner_model), and the AUC is always around 0.94. But when I use XGBoost the AUC can easily reach above 0.99.

How are you measuring and comparing both setups? Are you doing progressive validation in both cases? It's important for me to understand if you're comparing apples to apples. Also, have you tried the tree-based and nearest neighbors models in River?

I have to trade off the accuracy of predicting the old data against the new data?

Indeed an online model will usually be better at predicting recent data rather than old data. Is that not ok in your use-case?

Do you know of any examples of using river in a production environment?

This is a different question, correct? Are you asking how to deploy an online model? Or is your question still around performance?

shelovemee commented 2 years ago

Hi @MaxHalford

  1. I installed river via "pip install river". The version is 0.11, it told me "module 'river.compose' has no attribute 'pure_inference_mode' ". Today i updated river manually to 0.12.1, now the error has gone.

  2. I still have the prediction problem, here is my code:

import pandas as pd

ex = [.........] df = pd.read_csv('...csv', index_col=False, engine='python', encoding='utf-8') model = compose.Pipeline(preprocessing.StandardScaler(), linear_model.LogisticRegression()) metric = metrics.ROCAUC() class_condition = lambda x: x.class.name in ('StandardScaler', 'LinearRegression')

for index, row in df.iterrows(): x = dict(zip(ex, row.tolist()[:-1])) y_pred = model.predict_one(x) cy = row['TYPE'] metric = metric.update(cy, y_pred) model = model.learn_one(x, cy)

pprint(metric)

Now i predict another 10 rows,

list1= [......] list2= [......] list3= [......] ... with utils.log_method_calls(class_condition), compose.pure_inference_mode(): print('Predict list1.....') x = dict(zip(ex, list1)) print(model.predict_one(x)) print('Predict list2.....') x = dict(zip(ex, list2)) print(model.predict_one(x)) print('Predict list3.....') x = dict(zip(ex, list3)) print(model.predict_one(x)) ...................

The prediction is different after every training. For example, the first time i got "true true true true true ....", running the script again i got "true true true false true ....". Every time i got 1 or 2 differences.

  1. I tried four classifier in river, .model = ensemble.ADWINBaggingClassifier(model=(preprocessing.StandardScaler() | linear_model.LogisticRegression()), n_models=3, seed=42) .model = ensemble.AdaBoostClassifier(model=(tree.HoeffdingTreeClassifier(split_criterion='gini', split_confidence=1e-5, grace_period=2000)), n_models=5, seed=42) .model = compose.Pipeline(preprocessing.StandardScaler(), linear_model.LogisticRegression()) .model = compose.Pipeline(preprocessing.StandardScaler(), neighbors.KNNClassifier(window_size=50))

LogisticRegression gave me the best AUC 0.94, to predict the real data LogisticRegression also performs the best between these four classifier.

I tried XGBoost with the same data,

param = {'booster': 'gbtree', 'max_depth': 20, 'eta': 0.03, 'silent': 1, 'tree_method': 'gpu_hist', 'objective': 'binary:logistic'} param['nthread'] = 4 param['eval_metric'] = 'auc' num_round = 1000 bst = xgb.train(param, xgtrain, num_round, watchlist, early_stopping_rounds=3)

It gave me 0.993 AUC when it stopped, to predict the real data XGBoost also performs better than river. Maybe i need to tune the hyperparameters for river?

  1. Sorry i didn't make myself clear, I didn't ask how to deploy river in production environment, i was wondering if anyone is actually using river to replace xgboost, tensorflow, or lightgbm, etc. I would like to know if it is possible to take advantage of river's convenience while maintaining the accuracy of previous ML models, thanks for your patience.
MaxHalford commented 2 years ago

I installed river via "pip install river". The version is 0.11, it told me "module 'river.compose' has no attribute 'pure_inference_mode' ". Today i updated river manually to 0.12.1, now the error has gone.

Ok weird, but glad to know the error has gone.

I still have the prediction problem, here is my code:

We're drifting away from the initial problem. Can you please provide a reproducible example? Right now I can't copy/paste your code to reproduce your error.

I tried XGBoost with the same data,

What I don't get is exactly you're comparing the models, because some are online and the is batch. Are you doing cross-validation?

Sorry i didn't make myself clear, I didn't ask how to deploy river in production environment, i was wondering if anyone is actually using river to replace xgboost, tensorflow, or lightgbm, etc. I would like to know if it is possible to take advantage of river's convenience while maintaining the accuracy of previous ML models, thanks for your patience.

That's a bit of large question. There are certainly people doing online learning (with River or not) because they want models that adapt faster. Batch models are almost always more accurate than online models, but that's only true if you assume both models have been trained on a static dataset. If the dataset is streaming, then the online model can keep learning, and thus may outperform the batch model.

smastelini commented 2 years ago

Hi @shelovemee, allow me to throw my two cents in.

Firstly, you don't need to keep the log_method_calls, this is just for debugging purposes:

from river import compose
from river import linear_model
from river import metrics
from river import preprocessing

model = preprocessing.StandardScaler() | linear_model.LogisticRegression
training_data = # ... training dataset goes here

with compose.warm_up_mode():
    for x, y in training_data:
        model.learn_one(x, y)

# Testing
testing_data = # ... testing data goes here

metric = # ... metrics go here
with compose.pure_inference_mode():
    for x, y in testing_data:
        y_pred = model.predict_one(x)
        metric.update(y, y_pred)

Regarding your models, I noticed that you set the grace_period to a high value when using trees. Depending on the size of your training data, that is not a good idea. The trees will take too long to split, and the accuracy will be subpar.

On the other hand, if you are using batch data and comparing incremental algorithms against batch machine learning models, the batch algorithm will probably be the most accurate one. It's just how things are: XGBoost can do multiple passes in the data, whereas the incremental models only process each datum once.

shelovemee commented 2 years ago

Thanks @MaxHalford and @smastelini

I found if i shuffle Pandas dataframe rows, the prediction will be different every time.

from sklearn.utils import shuffle
import pandas as pd
dt = pd.read_csv(........)
dt =shuffle(dt)

model = preprocessing.StandardScaler() | linear_model.LogisticRegression()

with compose.warm_up_mode():
       for index, row in df.iterrows():
           x = dict(zip(column, row.tolist()[:-1]))
           model = model.learn_one(x, cy)
.......
metric = metrics.ROCAUC()
with compose.pure_inference_mode():
        x = dict(zip(column, list1))
        y_pred = model.predict_one(x)
        metric = metric.update(correcty, y_pred)
       print(y_pred)

pprint(metric)

If i comment "dt =shuffle(dt)", the prediction will be same but i will get a lower AUC. Shuffling will give me better AUC but resulting in inconsistent prediction. :(

I also tried the builtin dataset. It is strange no matter how I change the order of the training data, or even the number of training data, the AUC is the always 58.33%.

from river import datasets
from river import compose
from river import linear_model
from river import metrics
from river import preprocessing
from pprint import pprint
from itertools import islice

dataset = datasets.Phishing()
model = preprocessing.StandardScaler() | linear_model.LogisticRegression()

with compose.warm_up_mode():
    for x, y in dataset:
        model = model.learn_one(x, y)

metric = metrics.ROCAUC()
with compose.pure_inference_mode():
    #predict last 10 rows
    for x, y in islice(dataset, 1240, None):
        y_pred = model.predict_one(x)
        metric.update(y, y_pred)
        print(y_pred)
pprint(metric)

If i change     

for x, y in dataset:
        model = model.learn_one(x, y)
to
    for x, y in islice(dataset, 400, 1240):
        model = model.learn_one(x, y)
    for x, y in islice(dataset, 200):
        model = model.learn_one(x, y)
    for x, y in islice(dataset, 200, 400):
        model = model.learn_one(x, y)

The prediction is always the same. Even i only trained the model with the first 200 rows, The results are still consistent, Am I still wrong somewhere? Thanks.

smastelini commented 2 years ago

Hi @shelovemee, I can only give my best guess since you provided only partial snippets.

When dealing with data streams, we almost always think big. We expect that data might be unbounded. So, after processing tons of data, the order does not matter much. But if you're using small training data, it is expected that changing the order in which the instances arrive might slightly impact the performance.

Besides, it is not clear if the test data is always fixed in your example. If that is not the case, we expect the performance to change.

Also, River is primarily meant to update one instance at a time. In your examples, you train your model in a fixed dataset and then use the model to predict a testing batch. Is the model going to be updated after that? If that is not the case, then a traditional batch model is your best bet.

MaxHalford commented 2 years ago

I agree with what @smastelini says. Shuffling the data is antinomic, because you shouldn't be doing it in the first place. The order of your dataset matters: it's supposed to reflect the order in which the data arrived. Also it seems you're not doing progressive validation, which is the standard way for evaluating online models.

shelovemee commented 2 years ago

@smastelini @MaxHalford Thanks for your explanation. I think i need to read https://maxhalford.github.io/blog/online-learning-evaluation/ first. :)

MaxHalford commented 2 years ago

If you don't mind we'll close this issue because it's drifting in different directions. Feel free to open a discussion about a particular topic if necessary.