vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.23k stars 590 forks source link

after applying ensemble models, state transfer issue #594

Open jahoy opened 4 years ago

jahoy commented 4 years ago
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from vaex.ml.sklearn import Predictor

features = list(set(df_train.get_column_names()) - {'target'})

lgbm_params = {'colsample_bytree': 0.725,
          'learning_rate': 0.013,
         'num_leaves': 56,
          'reg_alpha': 0.754,
          'reg_lambda': 0.071,
          'subsample': 0.523,
          'n_estimators': 100,
          'n_jobs':8}
lgbm = lightgbm.LGBMRegressor(**lgbm_params)
vaex_lgbm = Predictor(model=lgbm, features=features, target='target', prediction_name='prediction_lgbm')

rf = RandomForestRegressor(n_estimators=100, n_jobs=-1)
vaex_rf = Predictor(model=rf, features=features, target='target', prediction_name='prediction_rf')

elasticnet = ElasticNet(alpha=0.01, l1_ratio=0.8, selection='random', max_iter=500)
vaex_elasticnet = Predictor(model=elasticnet, features=features, target='target', prediction_name='prediction_elasticnet')

for model in [vaex_lgbm, vaex_rf, vaex_elasticnet]:
    model.fit(df_train)
    df_train = model.transform(df_train)

after above code

and then I tried this code

df_test.state_set(df_train.state_get())

it always crashed running session.

I think there is problem something in transfering state.

JovanVeljanoski commented 4 years ago

Hi @jahoy

Thanks for the report. Unfortunately this is not very helpful in this form. Can you please provide a small reproducible example. Or maybe a subset of the data that you are using. Maybe it has the do with the way you are pre-processing the features. What is the error you are getting?

When I run your code using test data that I have at hand (like the breast cancer dataset from sklearn for example) I get no problem..

jahoy commented 4 years ago

especially, randomforest does not work.

when I try to train models individually, lightgbm and elasticnet are work, but randomforest does not work. when I trained randomforst model, then and then I tried this code df_test.state_set(df_train.state_get())

then, the session is crashed and kernel is dead.

jahoy commented 4 years ago

I think

df_test.state_set(df_train.state_get())

when I try this code, RAM is increasing much. and then session is dead.

all things happen in Colab and Paperspace(Gradient) (cloud jupyter notebook environment)

JovanVeljanoski commented 4 years ago

Hi,

I can't open the notebook you posted, it asks for an account.. Can you please share it elsewhere, like github, or collab?

What could be happening with the random forests.. so if you build a very big forest (thousands of trees), the model needs to be stored. The more trees you build, the bigger the model gets. When you do the state transfer, you need to copy the model (well the whole state, but most of it is small, but the model can be large). So maybe you are getting out of memory and the kernel crashes.

You could test this out buy building a small forest, say 10 trees and see if that works.

Keep in mind that when using the predictors you wrote above, the data is kept in memory before it is passed to them.

What is the error you are getting?

jahoy commented 4 years ago

it is not error. I think when I use this code df_test.state_set(df_train.state_get()) , the ram is super increased and crashed, then session is crashed.

I tried Randomforest with n_estimators = 10, that is work. but, n_estimators = 50, it does not work..

I tried this work with 30GiB RAM. but it is crashed too. that is werid. I tried many times. the result is always crashed ram. fitting train data is OK, but when trasfering state to test data, the ram will be crashed

data is just have 1million rows data. you can see my work here. but you can not run the project. because the data is from private project. anyway, I think you can test dummy data with 1million rows and RandomForest Model

JovanVeljanoski commented 4 years ago

Can you please share a very small subset of the data, or at least the number of features and their dtypes so I can try to make a synthetic dataset?

I have 16 GB of RAM on my local machine and I've never reaching anywhere close to memory issues, but maybe the testing data i am trying stuff on is very different..

JovanVeljanoski commented 4 years ago

@jahoy

Ahh i can see some example data on your collab notebook. I will see if I can reproduce your issue locally with RF and get back to you.

Cheers!

jahoy commented 4 years ago

Thank you for your support~! what I found now is that After fitting train data, transfering state from train data to test data make RAM issue, but just using transform function does not have memory issue.

vaex_rf.fit(df_train)
df_test = vaex_rf.transform(df_test)

above code does not have memory issue.

,but below code, have memory issue when I use Random Forest Model.

df_test.state_set(df_train.state_get())

I think 'state_set()' function and 'state_get()' function use many memory.

Thank you for your reply and support

JovanVeljanoski commented 4 years ago

Hi, can you tell me how much RAM is your machine using just prior to you doing this line?

df_test.state_set(df_train.state_get()) ?

JovanVeljanoski commented 4 years ago

Also if you have access to the disk of your machine, can you also please try df_train.state_write(f='train_state.json')

This will save the state to disk, in a JSON format. If you can do that, can you please tell me how much disk space does it consume?

Thanks

jahoy commented 4 years ago

I can not exactly tell the RAM usuage before this code df_test.state_set(df_train.state_get()) but I know that, I used just 1/10 total of RAM. so RAM is many avaliable. but, after that code, RAM is super increasing and session is crashed.

also, I tried this code too.df_train.state_write(f='train_state.json') it crashed RAM, so I think state transfter function, state_write function use many memory. thanks for your support

JovanVeljanoski commented 4 years ago

Ok, i will to reproduce this issue based on the data you provided and let you know.

jahoy commented 4 years ago

Thank you very much!! I hope this issue could be solved quickly. Thanks

JovanVeljanoski commented 4 years ago

@jahoy

I am able to reproduce the issue you are facing using a mock data based on your example. I think I understand where the problem is coming from: it has to do with the size of the model. I will discuss this with @maartenbreddels on how or if we can improve this. I hope to get back to you soon.

If this is an urgent issue, please contact us at contact@vaex.io Jovan.

jahoy commented 4 years ago

Yes, I think your team can try to use joblib or something. as for as I know size of json is big. I hope your team could solve this problem nicely. Thank you for your support~!!

maartenbreddels commented 4 years ago

Shortly discussed this with Jovan, and it seems indeed the json becomes to big. I'll wait for him to make a reproducable example for me. I have anticipated this, and we now have a framework inside vaex for dealing with json + binary blobs that the remote dataframe/server works with. This should be really easy to adapt to either:

Having the .vaex file uncompressed (or the directory solution) should make it possible to memory map the data.

JovanVeljanoski commented 4 years ago

Here is a reproducible example, based on the example provided by @jahoy :

import vaex
import vaex.ml
from vaex.ml.sklearn import Predictor
import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor

# Create mock data
n_samples = 1_000_000
n_features = 35
n_informative = 10

X, y = make_regression(n_samples=n_samples, 
                       n_features=n_features, 
                       n_informative=n_informative, 
                       bias=0.01, 
                       noise=1.5, 
                       random_state=42)

pandas_df = pd.DataFrame(data=X, columns=['feat'+ str(i) for i in range(X.shape[1])])
pandas_df['target'] = y
df = vaex.from_pandas(pandas_df, copy_index=False)

# Train test split
df_train, df_test = df.ml.train_test_split(test_size=0.2, verbose=False)

# Selecting features and target
features = list(set(df_train.get_column_names()) - {'target'})
target = 'target'

# Instantiate and fit the model
rf = RandomForestRegressor(n_estimators=55, n_jobs=-1)
vaex_rf = Predictor(model=rf, features=features, target=target, prediction_name='prediction_rf')
vaex_rf.fit(df_train)
df_train = vaex_rf.transform(df_train)

# Do the state transfer
state = df_train.state_get()   # ---> This is where the memory usage spikes <---
df_test.state_set(state)

Note that in this example, much like in the example provided by @jahoy , the random forest model does not limit the depth of the trees, which can result for deep trees in some tasks. I guess in reality you do want to set the max_depth parameter to something to handle over-fitting, and that I suspect will greatly reduce this problem. However, sometimes deep trees are necessary, so it would be nice if we can handle these cases.

JovanVeljanoski commented 4 years ago

Indeed, I just tested the above code again with rf = RandomForestRegressor(n_estimators=55, n_jobs=-1, max_depth=7) and the state transfer is instant without any problems what so ever.

So indeed.. the depth of the trees is taking up a lot of space..

maartenbreddels commented 4 years ago

I can confirm it's an issue, the model is >3gb in memory:

>>> import pickle
>>> s = pickle.dumps(rf)
>>> len(s)//1024**2
3395

Thanks, I'll think about a solution.

maartenbreddels commented 4 years ago

joblib does not help btw:

from joblib import dump
import io
f = io.BytesIO()
dump(rf, f)
f.tell() // 1024**2
3395
melgazar9 commented 2 years ago

I think this is somewhat related, but I believe this is a very interesting example that the vaex team can fix. The following code example hangs indefinitely and crash the working session. This happens whether I'm in jupyter or running through the terminal via python script. I don't think this is 100% due to the size of the dataframe, because I reduced the size significantly to 10k rows for each train val and test sets.

df = <big dataframe with 250 columns X 6M rows>
df_train, df_val = df.ml.train_test_split(test_size=0.2)
df_val, df_test = df_val.ml.train_test_split(test_size=0.5)

df_train = df_train.head(10000)
df_val = df_val.head(10000)
df_test = df_test.head(10000)

scaler = vaex.ml.MinMaxScaler(features=numeric_features, prefix='')
oh_encoder = vaex.ml.OneHotEncoder(features=lc_features, prefix='')
mean_encoder = vaex.ml.BayesianTargetEncoder(features=hc_features, prefix='', target='target')

df_train = scaler.fit_transform(df_train)
df_train = oh_encoder.fit_transform(df_train)
df_train = mean_encoder.fit_transform(df_train)

state = df_train.state_get()
df_val = df_val.state_set(state)
df_test = df_test.state_set(state)

xgb_params = {'learning_rate': 0.05,
              'max_depth': 6,
              'objective': 'binary:logistic',
              'subsample': 0.8,
              'random_state': 0,
              'n_jobs': -1}

xgb = XGBoostModel(features=final_features,
                   target='target',
                   num_boost_round=250,
                   params=xgb_params)

### Xgboost ###

xgb.fit(df=df_train, evals=[(df_train, 'train'), (df_val, 'val')], early_stopping_rounds=5)

lgbmc_params = dict(
    max_depth=6,
    num_leaves=25,
    subsample=0.8,
    colsample_bytree=0.9,
    learning_rate=0.03)

### LightGBM ###
lgbmc = LightGBMModel(features=final_features,
                      target='target',
                      num_boost_round=350,
                      params=lgbmc_params)

lgbmc.fit(df=df_train, valid_sets=[df_train, df_val], valid_names=['train', 'val'], early_stopping_rounds=5)

models = [xgb, lgbmc]

for model in models:
    df_train = model.transform(df_train)
    df_val = model.transform(df_val)
    df_test = model.transform(df_test)
    df_test_orig = model.transform(df_test_orig)

    new_prediction = df_train.get_column_names()[-1]
    if isinstance(df_train[new_prediction].values[0], np.ndarray):
        df_train[new_prediction] = df_train[new_prediction][:, 1]
        df_val[new_prediction] = df_val[new_prediction][:, 1]
        df_test[new_prediction] = df_test[new_prediction][:, 1]
        df_test_orig[new_prediction] = df_test_orig[new_prediction][:, 1]

pred_cols = ['xgboost_prediction', 'lightgbm_prediction']

dummy_pred = df_train['target'].mean()

df_pred_train = pd.DataFrame(df_train['target'].to_pandas_series(), columns=['target'])
df_pred_train.loc[:, 'dummy'] = dummy_pred
df_pred_val = pd.DataFrame(df_val['target'].to_pandas_series(), columns=['target'])
df_pred_val.loc[:, 'dummy'] = dummy_pred
df_pred_test = pd.DataFrame(df_test['target'].to_pandas_series(), columns=['target'])
df_pred_test.loc[:, 'dummy'] = dummy_pred

for col in pred_cols:

    df_pred_train.loc[:, col] = df_train[col].to_pandas_series()
    df_pred_val.loc[:, col] = df_val[col].to_pandas_series()
    df_pred_test.loc[:, col] = df_test[col].to_pandas_series()

df_pred_train['dataset_split'] = 'train'
df_pred_val['dataset_split'] = 'val'
df_pred_test['dataset_split'] = 'test'

df_pred = pd.concat([df_pred_train, df_pred_val, df_pred_test])

keep_cols = ['id'] + [i for i in df_test.get_column_names() if i.endswith('prediction')]

df_to_save = df_test[keep_cols]

# Note: df_to_save is a valid vaex dataframe. When I print the df i can see all the rows and columns with values. 0 of them are NA

######################################################
### All of the below code snippets fail except one ###
######################################################

# fails / hangs indefinitely
df_to_save.export('output.hdf5')

# fails / hangs indefinitely
df_to_save.export_hdf5('output.hdf5')

# fails / hangs indefinitely
df_to_save.export_csv('output.csv')

# fails / hangs indefinitely
df_to_save = df_preds.to_pandas_df()

# passes
df_to_save['id'] = df_preds['id'].to_pandas_series()

# fails / hangs indefinitely
df_to_save['xgboost_prediction'] = df_preds['xgboost_prediction'].to_pandas_series()
JovanVeljanoski commented 2 years ago

@melgazar9 can you provide some sample data (can be fake / generated etc..). We can't run the code like this..