interpretml / interpret-community

Interpret Community extends Interpret repository with additional interpretability techniques and utility functions to handle real-world datasets and workflows.
https://interpret-community.readthedocs.io/en/latest/index.html
MIT License
418 stars 85 forks source link

'Expecting data to be a DMatrix object, got: ', <class 'pandas.core.frame.DataFrame'> #498

Open yzheng27 opened 2 years ago

yzheng27 commented 2 years ago

Was following example https://github.com/interpretml/interpret-community/blob/master/notebooks/explain-regression-local.ipynb on my own data and xgboost object, but get error ('Expecting data to be a DMatrix object, got: ', <class 'pandas.core.frame.DataFrame'>) at explainer.explain_global(x_test). Changed x_test to DMatrix generates error 'DMatrix' object has no attribute 'shape'. Please advise. Thank you.

x_train, x_test, y_train, y_test = train_test_split(df[features], df[LABEL], test_size=0.2, random_state=0)

from interpret.ext.blackbox import TabularExplainer
explainer = TabularExplainer(model, 
                             x_train, 
                             model_task = 'regression',
                             features=features)
global_explanation = explainer.explain_global(x_test)
# xgtest = xgb.DMatrix(x_test.values)
# global_explanation = explainer.explain_global(xgtest)

Version: interpret-community==0.23.0 interpret-core==0.2.7 xgboost==1.4.1

gaugup commented 2 years ago

@yzheng27 Thanks for reporting the issue. Could you try with the latest interpret-community release 0.24.2 and see if you continue to see this issue? In case you still see the issue, could you provide a sample notebook so that we can reproduce this issue locally. A stack trace of the error will also help us greatly in triaging this issue.

Regards,

imatiach-msft commented 2 years ago

@gaugup I think the issue is happening because they are using the XGBoost API that uses DMatrix, instead of the scikit-learn XGBoost API that is pandas compatible, so I'm guessing that upgrading to latest version won't fix it. @yzheng27 I will take a look to see if we can support DMatrix from XGBoost somehow, but an easy quick fix would be to use the scikit-learn API for XGBoost,

yzheng27 commented 2 years ago

thank you. i was able to generate the global_explanation by loading the model with scikit-learn interface. But now my notebook is running code below for several hours. is it expected? the shape of x_test is around 24000*325.

ExplanationDashboard(global_explanation, model, dataset=x_test, true_y=y_test, public_ip = host, port = 7780)
imatiach-msft commented 2 years ago

"the shape of x_test is around 24000*325" @yzheng27 yes that may be too large for the UI to handle, please limit it by downsampling to ~5k rows instead of 24K. If you are still seeing issues with downsampled data, then there might be something about the host/port configuration. However even then you should still see the dashboard, just what-if analysis and ICE plots won't work in the ExplanationDashboard.

imatiach-msft commented 2 years ago

@yzheng27 one other thing, are you importing the dashboard from raiwidgets package, on this repository:

from raiwidgets import ExplanationDashboard

https://github.com/microsoft/responsible-ai-toolbox

Make sure you don't import it from interpret-community package, as it has been moved to the other repository.

Also, can you run:

pip show raiwidgets

to check that you have the latest version of raiwidgets package with ExplanationDashboard?

yzheng27 commented 2 years ago

@imatiach-msft i'm using the library from raiwidgets and the version is 0.15.1.

I was able to get the dashboard with the data dimensions I mentioned, though it took several hours. Will try with the smaller data.

imatiach-msft commented 2 years ago

@yzheng27 if it took several hours but eventually worked then it must be that the UI just loaded too much data, and downsampling should speed it up significantly. All of the datapoints are loaded into the UI and I've noticed that usually after >5k datapoints the UI becomes very slow. Perhaps there is some way to change the UI to stream select data from python backend or to aggregate statistics across multiple points in the future for users who want to run it on a lot of data, I'm not sure. The ErrorAnalysisDashboard is actually able to work on millions of points if you pass in a sample_dataset for the Dataset Explorer, so perhaps something like that could be done for the ExplanationDashboard as well:

https://github.com/microsoft/responsible-ai-toolbox/blob/main/raiwidgets/tests/test_error_analysis_dashboard.py#L83