ALE_chapter: Intervals, Piece-Wise Constant Models and Categorical Features

Here my review:

In general the structure of the chapter is fine. You also show pretty well how the approximation to the theoretical ALE depends on the data and the choice of intervals. The biggest issue in my eyes are the long equations, they are a bit inconvenient to follow. And I think it would be nice to give the examples a meaningful name, especially when they are real subtitles. Under the titles and subtitles of the chapter I made more detailed comments on each paragraph.

A genereal introduction to the chapter is missing. You give good intuition why intervals are an issue, but I would also write something about the theoretical ALE and the categorial features as an intro to the whole chapter. This can be very short, just to get an overview.

How to choose the number and/or length of the intervals

good itroduction to this chapter

State of the art

important to mention that.

ALE Approximations

choice of intervals and approximation of theoretical ALE seems a little bit mixed up -> at least the subtitle should mention that there is a try to approximate the theoretical ALE maybe like: ALE estimation vs. theoretical ALE

Example 1

$\hat{f}_1 (x_1, x_2) = (x_1-4)(x_1-5)(x_1-6) + x_2^3$ is pretty long und uncomfortable to handle. -> Is it possible to show the same effect with a simpler polinoial of degree 3 (i.e. x_1^3 or (x_1^3)/5 to flatten it out)

Example 2

for the formula $\hat{f}_1 (x_1, x_2) = (x_1-4)(x_1-5)(x_1-6)x_2^3$ the equations are very long and it is a bit uncomfortable to follow. Same as in Example 1 -> if it is possible to show the same effect with a shorter equation, it would be nice
Your explanation for the reletively bad fit of the estimated ALE (compared to the theoretical ALE) is that there are not enough data points in the crucial x1 area (between x1=7 and x1=10). For me it would be interesting to see if this problem will disappear when using 1000 or 10000 data points instead of 100. In my eyes this would proof your explanation.
Your planned plot 'which shows 50 ALE estimations (on different data samples)' is also a good idea, defenitely do that.

Example 3

I think it is a good idea to look at this prediction function: sin(10x_1)*x_2

Problems with piece-wise constant models

I see the same problems here and think it is a good choice to tackle them here.

Example 4

Here I think a simple example is enough to show the problems. -> So it might be good to choose a simple tree and not a random forest to keep up the intuition in your example.

Outlook

Sounds reasonable

Categorical Features

Ordering the features

more detailed but in general good intuition.

Example of ALE with categorical feature

would make this one subchapter: 'Example of ALE with categorical feature and interpretation'
Interpretation

Changes of the ALE due to different orders

good idea to show effect of different orders.

Example

Here I guess in the first place artifical data can be unseful. To sum this subchapter up, it could be nice to try another order for the above used example.

General structure:

Good choice of subchapters. Instead of putting the choice of intervals for first order ALE plots in 8.0.X, I would create a subchapter "8.1 Choice of Intervals for Continuous Features".

Intro:

Nice introduction to the issue of choosing the right number and size of intervals. The trade-off between small intervals and a sufficient amount of data points per interval was laid out well, including that a sufficiently high number of data points per interval is only necessary if there are effects of other features.

A genereal introduction to the chapter is missing. You give good intuition why intervals are an issue, but I would also write something about the theoretical ALE and the categorial features as an intro to the whole chapter.

I agree with Jakob. Right now you are only introducing the reader to the problem of choosing the right number and size of intervals. Also introducing the problem of creating ALE plots for piece-wise constant models and categorial features would give a more thorough introduction.

ALE Approximations:

Nice examples. Good theoretical foundation.

$\hat{f}_1(x_1, x_2) = (x_1-4)(x_1-5)(x_1-6) + x_2^3$ is pretty long und uncomfortable to handle. -> Is it possible to show the same effect with a simpler polinoial of degree 3 (i.e. x_1^3 or (x_1^3)/5 to flatten itout)

for the formula $\hat{f}_1 (x_1, x_2) = (x_1-4)(x_1-5)(x_1-6)x_2^3$ the equations are very long and it is a bit uncomfortable to follow. Same as in Example 1 -> if it is possible to show the same effect with a shorter equation, it would be nice

I agree that one could show the same effects with more simple equations. However, you already put in a lot of effort into these examples, so in my opinion you can keep them and concentrate on creating more simulations.

Your explanation for the reletively bad fit of the estimated ALE (compared to the theoretical ALE) is that there are not enough data points in the crucial x1 area (between x1=7 and x1=10). For me it would be interesting to see if this problem will disappear when using 1000 or 10000 data points instead of 100. In my eyes this would proof your explanation

This would indeed be interesting to see.

Your planned plot 'which shows 50 ALE estimations (on different data samples)' is also a good idea, defenitely do that.

I agree that this would be an interesting simulation.

Piece-wise constant models:

A simulation with multiple piece-wise constant prediction functions and ALE plots with different intervals would be very interesting.

Categorial features:

Again, a simulation with two different orderings of features (for a fixed model!) would be very interesting. "1. Komogorov-Smirnoff distance or frequency tables for categorical features 2. multidimensional scaling"

Outlook:

Ideas for possible solutions are definitely a plus. You do not have to go in-depth as the extent of simulations - if implemented - will be pretty high. A small discussion is sufficient.

slds-lmu / iml_methods_limitations