Large datasets: support for Dask arrays?

Hoeze commented 3 years ago

Hi, I tried training a ExplainableBoostingRegressor using Dask arrays, but I keep running into the following issue:

ERROR:interpret.utils.all:Could not unify data of type: <class 'dask.array.core.Array'>

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-53-b5e1d33190e1> in <module>
      1 model = create_model()
      2 
----> 3 fold_models, train_preds, valid_preds = model_cv(model)

<ipython-input-48-178d0374bff6> in model_cv(model)
     17 
---> 18             fold_model = sklearn.clone(model).fit(x_train_fold, y_train_fold)
     19 

/opt/anaconda/envs/ebm/lib/python3.8/site-packages/interpret/glassbox/ebm/ebm.py in fit(self, X, y)
    744         # TODO: PK don't overwrite self.feature_names here (scikit-learn rules), and it's also confusing to
    745         #       user to have their fields overwritten.  Use feature_names_out_ or something similar
--> 746         X, y, self.feature_names, _ = unify_data(
    747             X, y, self.feature_names, self.feature_types, missing_data_allowed=False
    748         )

/opt/anaconda/envs/ebm/lib/python3.8/site-packages/interpret/utils/all.py in unify_data(data, labels, feature_names, feature_types, missing_data_allowed)
    325         msg = "Could not unify data of type: {0}".format(type(data))
    326         log.error(msg)
--> 327         raise ValueError(msg)
    328 
    329     new_labels = unify_vector(labels)

ValueError: Could not unify data of type: <class 'dask.array.core.Array'>

Each of my folds is a 2D array consisting of 56 features and occupying ~16GB of memory. Passing model.fit(X.compute(), y.compute() crashes memory after some time, probably because of Joblib copying data around unnecessarily.

interpret-ml commented 3 years ago

Hi @Hoeze,

Thanks for bringing this up! We haven't looked into explicitly supporting Dask before, but it seems to be worth investigating. It's not too difficult to fix this specific error in our utility function -- we'd just need to add a type check for Dask arrays there -- but there might be other issues in other parts of the algorithm that would be unmasked when we do this. At first glance, it does seem promising -- Dask's hook-in for Joblib's scheduler might make this easier to support than other large scale parallelization frameworks.

Unfortunately there's no exact timeline on when we can investigate this deeply, but we'll use this issue to track any progress we make on it.

-InterpretML Team

MainRo commented 3 years ago

+1 for some support of distributed processing. Possibly, using ray for this would serve several purposes:

It could unify local/distributed processing (where local/multiprocess mode does not require a ray cluster)
It would be compatible with other frameworks that run on top of ray, including dask

Hoeze commented 3 years ago

+1 for some support of distributed processing. Possibly, using ray for this would serve several purposes:

For the record, I'm also relying on dask-on-ray. However, dask is the common standard. Ray just provides a distributed scheduler for dask :)

candalfigomoro commented 3 years ago

Hi @Hoeze,

Thanks for bringing this up! We haven't looked into explicitly supporting Dask before, but it seems to be worth investigating. It's not too difficult to fix this specific error in our utility function -- we'd just need to add a type check for Dask arrays there -- but there might be other issues in other parts of the algorithm that would be unmasked when we do this. At first glance, it does seem promising -- Dask's hook-in for Joblib's scheduler might make this easier to support than other large scale parallelization frameworks.

Unfortunately there's no exact timeline on when we can investigate this deeply, but we'll use this issue to track any progress we make on it.

-InterpretML Team

Don't forget about Spark (PySpark) :) There's a joblib backend for Spark too: https://github.com/joblib/joblib-spark

Also see: https://github.com/interpretml/interpret/issues/243

onacrame commented 2 years ago

Another vote for dask here

interpretml / interpret

Large datasets: support for Dask arrays? #249