[x] The point of the paragraph "So far, this book has focused on detailed explanation of the mlr3 universe of packages, which is abstracted in such a way that allows users to choose the level of complexity that suits each project." is unclear to me. The text suggests that if you're using only mlr3 you can only do small (sub)sets of data. The whole paragraph seems redundant.
[x] Second paragraph mentions "deep learning architectures, image analysis, multi-label classification" as supported, but doesn't give any details on how to do this. I would move this to the end of the chapter and provide pointers to relevant packages. Packages "waiting to be developed" shouldn't be mentioned; otherwise we might as well say that, taking future development into account, mlr3 will be the best and most popular ML package ever.
[x] Not sure about the fictional company example in 8.1. An insurance company wouldn't give out loans, and we don't have similarly "chatty" examples elsewhere.
[x] In the benchmarking example in 8.1.1 I would not use paste0 and simply spell out all the learners. This makes it unnecessarily difficult to understand, especially as paste0 isn't explained.
[x] Worth pointing out at the end of 8.1.1 that a positive loss means a loss for the company, i.e. none of the models performs as well as we would like to.
[x] 8.2. By example -> For example
[x] 8.2.1 "Predicting a confidence interval is possible using only features in mlr3" -- don't understand this. Why would it not be possible using only features? What else could we use to make the prediction?
[ ] 8.2.1 Where does the 95% confidence interval come from? Is this arbitrary, is it what the se prediction type is always guaranteed to return? Also briefly mention relationship to quantile = 1.96.
[ ] "Analogously to classification, we might think of the top prediction as “P(Y = 30) = 0.4” versus the fourth prediction as “P(Y = 23) = 0.7” (numbers not exact). This level of detail allows more nuanced metrics to be utilized to measure probabilistic predictions" -- not clear to me what this means. I can do the same thing without confident intervals. Why does it help to think of predictions this way?
[ ] Example in 8.2.2 should continue the one above (i.e. mean and se from a real model) instead of arbitrary numbers.
[ ] Not sure about 8.2.2 in general. I found parts of it very confusing. Getting mean and se predictions from a model uses what the model does, which makes distributional assumptions. The section seems to suggest that we can just ignore that and fit arbitrary distributions and, more confusingly, have different learners predict mean and se. In the final example, why would I use a featureless learner to get the se instead of simply getting it from the random forest that I've already trained anyway? More motivation needed for this section; why would I want to do this and what problem is it addressing/solving?
[x] 8.3. Why is there a footnote "survival analysis"?
[x] 8.3. censoring events -> censored events
[x] 8.3.1 what are the red crosses in the autoplot?
[x] 8.3.2 says that all possible prediction types are returned automatically, then goes on to say that this isn't actually true for response.
[x] 8.3.2.2 Brief explanation of what the numbers mean would be good. Graph and numeric predictions inconsistent (graph stops at about 52).
[x] 8.3.2.3 Unclear what is actually being predicted -- is it literally a weight vector for the features? Why is this a prediction and not part of the model that can be inspected after training (like for regression)? In particular as it's not supported by most learners -- is this really just a different way of inspecting the model?
[x] "the difference between values has meaning but should not be over-interpreted" needs more explanation. Can I, in this example, say that the second rat is an order of magnitude more likely to die than the first or would that be an "over-interpretation"?
[x] 8.3.3 show example using the model trained above.
[x] 8.3.4.1 crank = lp isn't lp a weight vector (see comment above).
[x] 8.3.4.2 "You may want to use the first pipeline to overwrite the default method of transforming distributions to rankings." -- can't say that if not also showing how to do this.
[x] 8.3.5 explain briefly what the different measures mean. Also again recommend against paste0; instead spell out the measures.
[x] 8.4. Is the density estimate only for a single variable, i.e. feature? What if there are multiple features in the data?
[x] 8.4. explain where name for mlr3proba comes from when it is first mentioned.
[x] 8.4.2 no autoplot function for this?
[x] 8.4.4 ref to ranger is broken.
[x] 8.4.4 suggest showing the actual distribution objects to highlight that they are the same in the first and different in the second example. Also explain difference between Distribution and VectorDistribution.
[x] 8.5.2 mention that predict is equivalent to assign and thus does make sense.
[x] 8.5.2 the code examples have the same comment "using same data for estimation" as "rare" and "common" use case -- explain the difference.
[x] 8.5.3.1 and 8.5.4 overlap, i.e. the former is also about visualization. I would merge these two sections and at the end of 8.5.3 mention that visualization (covered below) is often a better way of assessing the result of a clustering.
[x] 8.5.4 instead of just PCA I would say that dimensionality reduction techniques are important for visualizing, for example PCA.
paste0
and simply spell out all the learners. This makes it unnecessarily difficult to understand, especially aspaste0
isn't explained.se
prediction type is always guaranteed to return? Also briefly mention relationship toquantile = 1.96
.response
.crank = lp
isn'tlp
a weight vector (see comment above).paste0
; instead spell out the measures.mlr3proba
comes from when it is first mentioned.autoplot
function for this?Distribution
andVectorDistribution
.predict
is equivalent toassign
and thus does make sense.