Closed glmcdona closed 1 year ago
Hi, I tried to look into this and found that FLAML actually preserved the sparsity of the data. I did not finish debugging this but I found that FLAML uses LGBM with different parameters, which may be the cause of this.
For your first example, the LGBMRegressor will create boosters with parameters.
boosting_type=gbdt colsample_bytree=1.0 learning_rate=0.1 max_depth=-1 min_child_samples=20 min_child_weight=0.001 min_split_gain=0.0 n_jobs=-1 num_leaves=31 reg_alpha=0.0 reg_lambda=0.0 subsample=1.0 subsample_for_bin=200000 subsample_freq=0 verbose=-1 objective=regression metric=regression num_iterations=100
While FLAML will create boosters with parameters.
boosting_type=gbdt colsample_bytree=0.7019911744574896 learning_rate=0.022635758411078528 max_depth=-1 min_child_samples=2 min_child_weight=0.001 min_split_gain=0.0 n_jobs=-1 num_leaves=122 reg_alpha=0.004252223402511765 reg_lambda=0.11288241427227624 subsample=1.0 subsample_for_bin=200000 subsample_freq=0 max_bin=511 verbose=-1 objective=regression metric=regression num_iterations=4797
Notice that FLAML allows for many more iterations and leaves, which (I think) is due to the specification of the search space. I am not quite sure why does FLAML takes such a long time, but I don't think this is due to the sparsity of the data.
Hi @glmcdona ,
I think I've found a fix to this problem. So for the first task, what you can do to significantly reduce the training time is to set a max_iter constraint and specify the metric to 'mae'.
Here's my code and output, for which FLAML achieves roughly the same accuracy in a much shorter time.
automl = AutoML( task='regression', estimator_list=['lgbm'], max_iter = 10 ) automl.fit(X_train, y_train, metric="mae")
Outputs:
Dataset statistics: Number of rows: 500000 Number of unique categorical features: 786695 Data type after one-hot encoding: <class 'scipy.sparse.csr.csr_matrix'> LightGBM MAE: 0.49999983175033913 Fitting took 1.551173210144043 seconds
FLAML best estimator: LGBMRegressor(learning_rate=0.09999999999999995, max_bin=255, n_estimators=4, num_leaves=4, reg_alpha=0.0009765625, reg_lambda=1.0, verbose=-1) MAE: 0.49999983175033913 Fitting took 78.22783994674683 seconds
Hi @glmcdona ,
I think I've found a fix to this problem. So for the first task, what you can do to significantly reduce the training time is to set a max_iter constraint and specify the metric to 'mae'.
Here's my code and output, for which FLAML achieves roughly the same accuracy in a much shorter time.
automl = AutoML( task='regression', estimator_list=['lgbm'], max_iter = 10 ) automl.fit(X_train, y_train, metric="mae")
Outputs:
Dataset statistics: Number of rows: 500000 Number of unique categorical features: 786695 Data type after one-hot encoding: <class 'scipy.sparse.csr.csr_matrix'> LightGBM MAE: 0.49999983175033913 Fitting took 1.551173210144043 seconds
FLAML best estimator: LGBMRegressor(learning_rate=0.09999999999999995, max_bin=255, n_estimators=4, num_leaves=4, reg_alpha=0.0009765625, reg_lambda=1.0, verbose=-1) MAE: 0.49999983175033913 Fitting took 78.22783994674683 seconds
If you check the console output or the log, how long does the first trial take in total?
Classification and regression models often leverage sparse arrays. For example when composing a Sklearn pipeline, sparse arrays are formed as the output of categorical OneHot transforms and ngramming like CountVectorizer and TfidfVectorizer.
These sparse arrays are passed into the learners (eg LightGbm, Xgboost, LogisticRegression), and these learners are designed to support these sparse arrays efficiently allowing large dimensionality classifiers to be trained.
The fix I think it to ensure we are keeping arrays in their sparse array formats, eg
scipy.sparse.csr_matrix
at least when passing into the underlying learner and preferably through the handling of the data by FLAML as well.Here are a few repro examples that illustrate the issue, note that generally any example using OneHot, CountVectorizer, or TfidfVectorizer will repro similarly and these transforms are common:
Example 1 - LightGbm regression example repro. The FLAML will execution part looks to densify the sparse matrix output from the ColumnTransformer (takes up 5GB RAM), and gets stuck failing to fit the learner. A couple notes:
from sklearn.preprocessing import OneHotEncoder from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.metrics import accuracy_score
from flaml import AutoML
from lightgbm import LGBMRegressor
Create an example dataframe with random integers and a target column
df = pd.DataFrame(np.random.randint(0, 1000000, size=(500000, 2)), columns=['A', 'B']) df['target'] = ((df['A'] + df['B']) > 1000000).astype(int) print(df.head(10))
Print dataset statistics
print("Dataset statistics:") print(f"Number of rows: {len(df)}") print(f"Number of unique categorical features: {df['A'].nunique() + df['B'].nunique()}")
Build a sklearn pipeline that one-hot encodes the integers
Create the pipeline with a LightGBM regression model
pipeline = Pipeline([ ('preprocessor', ColumnTransformer([ ('onehot', OneHotEncoder(handle_unknown='ignore'), ['A', 'B']) ])), ('classifier', LGBMRegressor()) ])
Fit the model while timing the process
import time start = time.time() pipeline.fit(df[['A', 'B']], df['target']) end = time.time()
Print the metrics
print("LightGBM") print("MAE:", np.mean(np.abs(pipeline.predict(df[['A', 'B']]) - df['target']))) print("Number of parameters:", pipeline['classifier'].nfeatures) print("Fitting took {} seconds".format(end - start))
Now fit a FLAML model instead with LightGBM and the same data
pipeline = Pipeline( [ ( 'preprocessor', ColumnTransformer( [ ('onehot', OneHotEncoder(handle_unknown='ignore'), ['A', 'B']) ] ) ), ( 'classifier', AutoML( task='regression', estimator_list=['lgbm'], ) ) ] ) start = time.time() pipeline.fit(df[['A', 'B']], df['target']) end = time.time()
Print the metrics
print("FLAML") print("MAE:", np.mean(np.abs(pipeline.predict(df[['A', 'B']]) - df['target']))) print("Fitting took {} seconds".format(end - start))
0 392214 547064 0 1 733104 70527 0 2 435315 879075 1 3 3221 115858 0 4 507700 113589 0 5 746200 312285 1 6 671553 164952 0 7 486703 540223 1 8 540288 793887 1 9 201758 563042 0 Dataset statistics: Number of rows: 500000 Number of unique categorical features: 786525 LightGBM MAE: 0.499999640448 Number of parameters: 786525 Fitting took 3.1742513179779053 seconds [flaml.automl: 11-04 21:18:01] {2600} INFO - task = regression [flaml.automl: 11-04 21:18:01] {2602} INFO - Data split method: uniform [flaml.automl: 11-04 21:18:01] {2605} INFO - Evaluation method: holdout [flaml.automl: 11-04 21:18:01] {2727} INFO - Minimizing error metric: 1-r2 [flaml.automl: 11-04 21:18:01] {2777} WARNING - No search budget is provided via time_budget or max_iter. Training only one model per estimator. To tune hyperparameters for each estimator, please provide budget either via time_budget or max_iter. [flaml.automl: 11-04 21:18:01] {2869} INFO - List of ML learners in AutoML Run: ['lgbm']
Output: