Why use the MAD in normalized features?

interpretml / DiCE

Generate Diverse Counterfactual Explanations for any machine learning model.

MIT License

1.33k stars 185 forks source link

You are correct in the interpretation of MAD however you missed the fact that the distribution of features after min-max scaling stays the same. Any scaling method will not change the shape of the data distribution but changes only the range. So dividing the distance by MAD (inverse_mad option) is to capture the relative prevalence of observing the feature at a particular value. Please refer to sec 3.3 "Choice of distance function." in our paper or refer to the paragraph below equation 4 in Wachter et al paper. However, I agree that you could use l2 distance scaled by standard deviation or any other variant, but we found the l1-MAD option working well for most scenarios in our experiments. Anyways, we will include more options for distance loss soon.

Meanwhile, in the Wachter et al. paper, the authors do not suggest to use l2-distance always, but instead they experiment with different variants of distance functions and even show that l1-MAD option generates sparser results (last para of LSAT data section, P20).

interpretml / DiCE

Why use the MAD in normalized features? #54