glmcdona commented 2 years ago

Classification and regression models often leverage sparse arrays. For example when composing a Sklearn pipeline, sparse arrays are formed as the output of categorical OneHot transforms and ngramming like CountVectorizer and TfidfVectorizer.

These sparse arrays are passed into the learners (eg LightGbm, Xgboost, LogisticRegression), and these learners are designed to support these sparse arrays efficiently allowing large dimensionality classifiers to be trained.

The fix I think it to ensure we are keeping arrays in their sparse array formats, eg scipy.sparse.csr_matrix at least when passing into the underlying learner and preferably through the handling of the data by FLAML as well.

Here are a few repro examples that illustrate the issue, note that generally any example using OneHot, CountVectorizer, or TfidfVectorizer will repro similarly and these transforms are common:

Example 1 - LightGbm regression example repro. The FLAML will execution part looks to densify the sparse matrix output from the ColumnTransformer (takes up 5GB RAM), and gets stuck failing to fit the learner. A couple notes:

This is just 700K distinct categorical features, but the same problem is present for smaller numbers like ~70K distinct features too. LightGbm can handle this.
Without FLAML it takes 3 seconds to fit.
With FLAML it hangs.
Python takes up 5GB RAM as soon as it tries to start the FLAML part.
```
import numpy as np
import pandas as pd
```

from sklearn.preprocessing import OneHotEncoder from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.metrics import accuracy_score

from flaml import AutoML

from lightgbm import LGBMRegressor

Create an example dataframe with random integers and a target column

df = pd.DataFrame(np.random.randint(0, 1000000, size=(500000, 2)), columns=['A', 'B']) df['target'] = ((df['A'] + df['B']) > 1000000).astype(int) print(df.head(10))

Print dataset statistics

print("Dataset statistics:") print(f"Number of rows: {len(df)}") print(f"Number of unique categorical features: {df['A'].nunique() + df['B'].nunique()}")

Build a sklearn pipeline that one-hot encodes the integers

Create the pipeline with a LightGBM regression model

pipeline = Pipeline([ ('preprocessor', ColumnTransformer([ ('onehot', OneHotEncoder(handle_unknown='ignore'), ['A', 'B']) ])), ('classifier', LGBMRegressor()) ])

Fit the model while timing the process

import time start = time.time() pipeline.fit(df[['A', 'B']], df['target']) end = time.time()

Print the metrics

print("LightGBM") print("MAE:", np.mean(np.abs(pipeline.predict(df[['A', 'B']]) - df['target']))) print("Number of parameters:", pipeline['classifier'].nfeatures) print("Fitting took {} seconds".format(end - start))

Now fit a FLAML model instead with LightGBM and the same data

pipeline = Pipeline( [ ( 'preprocessor', ColumnTransformer( [ ('onehot', OneHotEncoder(handle_unknown='ignore'), ['A', 'B']) ] ) ), ( 'classifier', AutoML( task='regression', estimator_list=['lgbm'], ) ) ] ) start = time.time() pipeline.fit(df[['A', 'B']], df['target']) end = time.time()

Print the metrics

print("FLAML") print("MAE:", np.mean(np.abs(pipeline.predict(df[['A', 'B']]) - df['target']))) print("Fitting took {} seconds".format(end - start))


Output:

    A       B  target

0 392214 547064 0 1 733104 70527 0 2 435315 879075 1 3 3221 115858 0 4 507700 113589 0 5 746200 312285 1 6 671553 164952 0 7 486703 540223 1 8 540288 793887 1 9 201758 563042 0 Dataset statistics: Number of rows: 500000 Number of unique categorical features: 786525 LightGBM MAE: 0.499999640448 Number of parameters: 786525 Fitting took 3.1742513179779053 seconds [flaml.automl: 11-04 21:18:01] {2600} INFO - task = regression [flaml.automl: 11-04 21:18:01] {2602} INFO - Data split method: uniform [flaml.automl: 11-04 21:18:01] {2605} INFO - Evaluation method: holdout [flaml.automl: 11-04 21:18:01] {2727} INFO - Minimizing error metric: 1-r2 [flaml.automl: 11-04 21:18:01] {2777} WARNING - No search budget is provided via time_budget or max_iter. Training only one model per estimator. To tune hyperparameters for each estimator, please provide budget either via time_budget or max_iter. [flaml.automl: 11-04 21:18:01] {2869} INFO - List of ML learners in AutoML Run: ['lgbm']


Example 2 - FLAML binary classification without learner specified (note: binary classification limited to lgbm works)
* Without FLAML it takes 3.2 seconds to fit
* With FLAML it hangs, but does NOT take up a large amount of RAM right away. Maybe an iteration issue rather than a dense conversion issue at least at first?

```python
import numpy as np
import pandas as pd

from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score

from flaml import AutoML

from lightgbm import LGBMRegressor, LGBMClassifier

# Create an example dataframe with random integers and a target column
df = pd.DataFrame(np.random.randint(0, 1000000, size=(500000, 2)), columns=['A', 'B'])
df['target'] = ((df['A'] + df['B']) > 1000000).astype(int)
print(df.head(10))

# Print dataset statistics
print("Dataset statistics:")
print(f"Number of rows: {len(df)}")
print(f"Number of unique categorical features: {df['A'].nunique() + df['B'].nunique()}")

# Build a sklearn pipeline that one-hot encodes the integers

# Create the pipeline with a LightGBM regression model
pipeline = Pipeline([
    ('preprocessor', ColumnTransformer([
        ('onehot', OneHotEncoder(handle_unknown='ignore'), ['A', 'B'])
    ])),
    ('classifier', LGBMClassifier())
])

# Fit the model while timing the process
import time
start = time.time()
pipeline.fit(df[['A', 'B']], df['target'])
end = time.time()

# Print the metrics
print("LightGBM")
print("MAE:", np.mean(np.abs(pipeline.predict(df[['A', 'B']]) - df['target'])))
print("Number of parameters:", pipeline['classifier'].n_features_)
print("Fitting took {} seconds".format(end - start))

# Now fit a FLAML model instead with LightGBM and the same data
pipeline = Pipeline(
    [
        (
            'preprocessor', ColumnTransformer(
                [
                    ('onehot', OneHotEncoder(handle_unknown='ignore'), ['A', 'B'])
                ]
            )
        ),
        (
            'classifier', AutoML(
                task='classification',
            )
        )
    ]
)
start = time.time()
pipeline.fit(df[['A', 'B']], df['target'])
end = time.time()

# Print the metrics
print("FLAML")
print("MAE:", np.mean(np.abs(pipeline.predict(df[['A', 'B']]) - df['target'])))
print("Fitting took {} seconds".format(end - start))

Output:

    A       B  target
0  435201   20896       0
1  650113  830984       1
2  296903  335422       0
3  733399  881420       1
4  968169  600231       1
5  243170   94188       0
6  303474  476584       0
7  950125  645745       1
8  607060  432579       1
9  741146  550150       1
Dataset statistics:
Number of rows: 500000
Number of unique categorical features: 787371
LightGBM
MAE: 0.499826
Number of parameters: 787371
Fitting took 3.257169246673584 seconds
[flaml.automl: 11-04 21:20:45] {2600} INFO - task = classification
[flaml.automl: 11-04 21:20:45] {2602} INFO - Data split method: stratified
[flaml.automl: 11-04 21:20:45] {2605} INFO - Evaluation method: holdout
[flaml.automl: 11-04 21:20:46] {2727} INFO - Minimizing error metric: 1-roc_auc
[flaml.automl: 11-04 21:20:46] {2777} WARNING - No search budget is provided via time_budget or max_iter. Training only one model per estimator. To tune hyperparameters for each estimator, please provide budget either via time_budget or max_iter.
[flaml.automl: 11-04 21:20:46] {2869} INFO - List of ML learners in AutoML Run: ['extra_tree', 'lgbm', 'rf', 'xgboost', 'xgb_limitdepth', 'lrl1']
[flaml.automl: 11-04 21:20:46] {3164} INFO - iteration 0, current learner extra_tree

jingdong00 commented 1 year ago

Hi, I tried to look into this and found that FLAML actually preserved the sparsity of the data. I did not finish debugging this but I found that FLAML uses LGBM with different parameters, which may be the cause of this.

For your first example, the LGBMRegressor will create boosters with parameters. boosting_type=gbdt colsample_bytree=1.0 learning_rate=0.1 max_depth=-1 min_child_samples=20 min_child_weight=0.001 min_split_gain=0.0 n_jobs=-1 num_leaves=31 reg_alpha=0.0 reg_lambda=0.0 subsample=1.0 subsample_for_bin=200000 subsample_freq=0 verbose=-1 objective=regression metric=regression num_iterations=100

While FLAML will create boosters with parameters. boosting_type=gbdt colsample_bytree=0.7019911744574896 learning_rate=0.022635758411078528 max_depth=-1 min_child_samples=2 min_child_weight=0.001 min_split_gain=0.0 n_jobs=-1 num_leaves=122 reg_alpha=0.004252223402511765 reg_lambda=0.11288241427227624 subsample=1.0 subsample_for_bin=200000 subsample_freq=0 max_bin=511 verbose=-1 objective=regression metric=regression num_iterations=4797

Notice that FLAML allows for many more iterations and leaves, which (I think) is due to the specification of the search space. I am not quite sure why does FLAML takes such a long time, but I don't think this is due to the sparsity of the data.

jingdong00 commented 1 year ago

Hi @glmcdona ,

I think I've found a fix to this problem. So for the first task, what you can do to significantly reduce the training time is to set a max_iter constraint and specify the metric to 'mae'.

Here's my code and output, for which FLAML achieves roughly the same accuracy in a much shorter time.

automl = AutoML( task='regression', estimator_list=['lgbm'], max_iter = 10 ) automl.fit(X_train, y_train, metric="mae")

Outputs: Dataset statistics: Number of rows: 500000 Number of unique categorical features: 786695 Data type after one-hot encoding: <class 'scipy.sparse.csr.csr_matrix'> LightGBM MAE: 0.49999983175033913 Fitting took 1.551173210144043 seconds

FLAML best estimator: LGBMRegressor(learning_rate=0.09999999999999995, max_bin=255, n_estimators=4, num_leaves=4, reg_alpha=0.0009765625, reg_lambda=1.0, verbose=-1) MAE: 0.49999983175033913 Fitting took 78.22783994674683 seconds

sonichi commented 1 year ago

Hi @glmcdona ,

I think I've found a fix to this problem. So for the first task, what you can do to significantly reduce the training time is to set a max_iter constraint and specify the metric to 'mae'.

Here's my code and output, for which FLAML achieves roughly the same accuracy in a much shorter time.

automl = AutoML( task='regression', estimator_list=['lgbm'], max_iter = 10 ) automl.fit(X_train, y_train, metric="mae")

Outputs: Dataset statistics: Number of rows: 500000 Number of unique categorical features: 786695 Data type after one-hot encoding: <class 'scipy.sparse.csr.csr_matrix'> LightGBM MAE: 0.49999983175033913 Fitting took 1.551173210144043 seconds

FLAML best estimator: LGBMRegressor(learning_rate=0.09999999999999995, max_bin=255, n_estimators=4, num_leaves=4, reg_alpha=0.0009765625, reg_lambda=1.0, verbose=-1) MAE: 0.49999983175033913 Fitting took 78.22783994674683 seconds

If you check the console output or the log, how long does the first trial take in total?

microsoft / FLAML

Larger dimensionality sparse array inputs cause many FLAML use cases to fail #798

Create an example dataframe with random integers and a target column

Print dataset statistics

Build a sklearn pipeline that one-hot encodes the integers

Create the pipeline with a LightGBM regression model

Fit the model while timing the process

Print the metrics

Now fit a FLAML model instead with LightGBM and the same data

Print the metrics