Confusing running time on different datasets

interpretml / DiCE

Generate Diverse Counterfactual Explanations for any machine learning model.

https://interpretml.github.io/DiCE/

MIT License

1.33k stars 184 forks source link

Confusing running time on different datasets #150

Open hobbitlzy opened 3 years ago

hobbitlzy commented 3 years ago

I am trying to find CF 5 examples on twodatasets: iris, wine. However, sometimes it takes more time for iris than wine (25sec VS 3sec). Moreover, the time spent on iris also fluctuates. The iris dataset owns lessfeatures, which I expect to be fast. So, what is the relationship of explanation time and feature number (and query number and CF examples number). There are 5 queries and each query is required to find 5 CF examples.

gaugup commented 3 years ago

@hobbitlzy, could you share some sample code that you are running for computing explanations for iris and wine?

The model agnostic approaches in DiCE are inference heavy which means they use the trained model to infer the predictions and spend significant time in using the trained model. So if the trained model's inference operation (like predict()/predict_proba()) are slow, then the run time will be slow.

However, we should investigate why iris exhibits such large deviations in runtime.

hobbitlzy commented 3 years ago

I followed the code in "DiCE_multiclass_classification_and_regression.ipynb". The wine is loaded by df_iris = load_wine(as_frame=True).frame to replace the code df_iris = load_iris(as_frame=True).frame . I don't think I changed any code except the number of queries. The underlying model I used is the "RandomForestClassifier", which is specified by the original code. I have tested different underlying models, such as SVM. Do you mean that the perturbed instance needs to be fed in the underlying model in each iteration?

gaugup commented 3 years ago

The features are perturbed by the DiCE library itself. You need to specify which features you wish to perturb. Let me try out wine. BTW which DICE ML version are you using?

gaugup commented 3 years ago

@hobbitlzy, I tried with wine dataset and 5 query instances with 5 counterfactuals. I didn't find the performance vary a lot in multiple runs for computing counterfactuals.

Also, you can change the method type when initializing the dice-ml explainer.

exp_iris = Dice(d_iris, m_iris, method="random/kdtree/genetic"). The method "random" is the fastest among the three, generally (Of course this is subjective and could vary with datasets).

hobbitlzy commented 3 years ago

The features are perturbed by the DiCE library itself. You need to specify which features you wish to perturb. Let me try out wine. BTW which DICE ML version are you using?

It is 0.6.1. Really thanks for your reply, but I think the problem is that the iris dataset is time consuming (I also tried "random" method), which I think is strange. I do not konw whether it is correct to spend around 25sec to explain iris? Could you please share the typical time you spend on it for reference.

amit-sharma commented 3 years ago

paging @soundarya98 will be good to resolve this.