Open Hoeze opened 3 years ago
Hi @Hoeze,
Thanks for bringing this up! We haven't looked into explicitly supporting Dask before, but it seems to be worth investigating. It's not too difficult to fix this specific error in our utility function -- we'd just need to add a type check for Dask arrays there -- but there might be other issues in other parts of the algorithm that would be unmasked when we do this. At first glance, it does seem promising -- Dask's hook-in for Joblib's scheduler might make this easier to support than other large scale parallelization frameworks.
Unfortunately there's no exact timeline on when we can investigate this deeply, but we'll use this issue to track any progress we make on it.
-InterpretML Team
+1 for some support of distributed processing. Possibly, using ray for this would serve several purposes:
+1 for some support of distributed processing. Possibly, using ray for this would serve several purposes:
For the record, I'm also relying on dask-on-ray. However, dask is the common standard. Ray just provides a distributed scheduler for dask :)
Hi @Hoeze,
Thanks for bringing this up! We haven't looked into explicitly supporting Dask before, but it seems to be worth investigating. It's not too difficult to fix this specific error in our utility function -- we'd just need to add a type check for Dask arrays there -- but there might be other issues in other parts of the algorithm that would be unmasked when we do this. At first glance, it does seem promising -- Dask's hook-in for Joblib's scheduler might make this easier to support than other large scale parallelization frameworks.
Unfortunately there's no exact timeline on when we can investigate this deeply, but we'll use this issue to track any progress we make on it.
-InterpretML Team
Don't forget about Spark (PySpark) :) There's a joblib backend for Spark too: https://github.com/joblib/joblib-spark
Also see: https://github.com/interpretml/interpret/issues/243
Another vote for dask here
Hi, I tried training a ExplainableBoostingRegressor using Dask arrays, but I keep running into the following issue:
Each of my folds is a 2D array consisting of 56 features and occupying ~16GB of memory. Passing
model.fit(X.compute(), y.compute()
crashes memory after some time, probably because of Joblib copying data around unnecessarily.