Chapter 6 comments - Githubissues

RaphaelS1 commented 1 year ago

Based on https://github.com/mlr-org/mlr3book/pull/385

Style

[ ] All pipelines plots need to be changed. Text is hard, in some cases impossible, to read. I suspect also completely inaccessible (e.g. dyslexia and colour blindness).
[ ] Throughout tables are printed that aren't useful or are falling off the page. Subset output as required so everything printed is readable and provides info to the user.
[ ] Avoid example that deliberately result in errors (e.g. in Creating PipeOps). Assume people will do what you tell them and can debug for themselves (or at least can open issues).
[ ] Don't include required parameter names in function calls, only optional keywords when required.
[ ] Make sure you use mlr language throughout, e.g., for Bagging the basic idea is to create multiple models not 'predictors'
[ ] Also in the bagging section you change terminology. Text says 'downsampling' but figure says 'subsample', use the latter.
[x] Printing the DictionaryPipeOp is not useful - it's hard to read and users haven't learnt anything. Better to manually print a table of (non-meta) pipeops and explain these

Content

[x] “…provided as so-called PipeOps” - this doesn't really tell us what a PipeOp is, similarly don't use phrases like 'so-called' as anything we name is 'so-called' by definition and it doesn't actually tell us you're defining something here. Similarly what's a 'meta-model' ? Too many confusing new terms introduced in the preamable.
[ ] The 'Quick Introduction' is not quick. You're introducing loads of new and complex topics but rushing through it. It would be better if you just ran an example, explained it in lay terms, and then said something like "and now we'll spend the rest of the chapter explaining what we just did!" I'd strongly consider you remove this entire section and just reduce it to one quick example in the intro.
[ ] To be consistent with the rest of the book change iris -> penguins
[x] Language in intro of PipeOp: Pipeline Operators is confusing and will isolate readers - I really do't know what 'generalized notion of a function, with a certain twist' means. Simplify the writing without dumbing down the content.
[x] Technically if you're passing a list into $train then this is still actually only one input..
[ ] See comments in other issues, don't refer to R6 and sugar, just use sugar then at the end of the chapter we'll show mappings from S3 to R6.
[ ] Description about name collissions and id is confusing. Just show this in a worked example at some point in the chapter and explain it then, doesn't need to be taken on its own.
[ ] po output is confusing and needs to be explained to user - it is not intuitive on its own
[ ] Need to explain all new printed output what does a printed graph show?
[ ] Don't need a separate section for $keep_results. Just briefly mention it somewhere else (in fact you do this in Accessing Pipeline Objects, this is probably sufficient)
[x] "While the whole Graph object behaves similarly to a Learner, it is still a distinct kind of object. To use a Graph as a Learner within mlr3, it is necessary to use a wrapper-class, GraphLearner." - hard to follow. Suggest just: "To treat a Graph like a Learner and to make use of the full mlr3 infrastructure, use the as_learner function". Later you can then put it all together.
[x] Pipeline Hyperparameters: Too much repetition about paradox and content in optimization chapter. It would be more interesting just to say that one advantage of treating a graph like a learner is the ability to tune, and then just show an example with AutoTuner
[ ] IDs and Name Clashes firstly duplicative of your discussion above about IDs (though that should be deleted), secondly doesn't need to be a section. Instead just show a worked example where you have to manually set IDs and add a comment explaining why you did this. Or we could even add an emoji in the margin for something like 'important concept' (e.g., ❗)
[x] Advanced Graphs - Remember for many users even the 'simple' thing is advanced. So I'd suggest changing name to something like 'Non-linear graphs'
[x] Bagging - Nowhere in the section does it say you have a bagging pipeline. But I suspect this will go in 'Common Patterns and ppl()' ? Need to explain innum. A more interesting example might be to use classifavg for multiple different models, then you can say the bagging pipeline is shorthand if all models are the same (but will leave choice to you)
[x] PipeOpLearnerCV and Stacking - Section broken?
[x] Stacking - What does NULL mean in your figure? Arrows inside the figure might help to understand it. Be careful about throwing in new terms without explaining, e.g. 'overfitting'. Need to explain "level 0 learner" and PipeOpNOP in more detail, they're complex to understand. Output at end is a really nice way to show this but could you consider a different output method, e.g. plotting decision tree? Will be simpler to follow and will look better when printed.
[x] Tuning Combined Spaces - Duplicative of hyperparameter content above. As said above, just need one AutoTuner example and you're done.
[x] Tuning Alternative Paths — What do you mean ‘only one path will be executed, the others are ignored’ for branching? I'm not show the figure helps explain this, maybe use arrows to make figure clearer? Example I think is too hard to follow, it doesn't help me understand when to use branching or when to use nop.
[x] Tuning PipeOpProxy - TODO
[x] "Common Patterns and ppl()" - Very important section, though currently looks like wrong content is there?
[ ] Multilevel Stacking - I like the idea of showing one very complicated example - but I think it would be better if you put this in the conclusion of the chapter and before it wrote something like "and now we'll show you the true power of pipelines".
[x] Graph, The Whole Story - Change name to a bit more intuitive like Technical Graph Details and we'll mark the section with an advanced emoji in the margin. Make clear in the intro to this section that it is advanced and also why someone might want to read it (i.e., what extra content do you teach). Just be careful not to write too much or it will distract from the rest of the book
[x] Specific PipeOps - Probably makes sense to move this before the previous section. Also in the intro make clear why you're talking about these one specifically. This section is nice, gives a good quick overview to PipeOps and shows how to use them.
[ ] Clearly state difference between chunking and subsampling in the PipeOpChunk section

General

[ ] Difficult exercise idea: create a superlearner using mlr3pipelines. Steps: i) subsample, ii) train models, iii) regress truth ~ predictions, iv) store regression coeffs. (required PipeOpLearnerCV)

mb706 commented 1 year ago

I regularly update the pdf that is linked in the pr comment.

Other things I will address (e.g. coherent vocabulary), I'll also check the other chapters to make sure things are consistent. A few comments on some, after discussion with Bernd:

pipeline plots need to be changed

I'll see if I can make the $plot() prettier, also change the IDs to make things more readable.

tables are printed that aren't useful

should already be a bit better but I will check again later

Avoid example that deliberately result in errors

I actually disagree on this, since the examples result in errors there is no chance the user will accidentally mess anything up. But Bernd is not on my side here, so I will take this out.

Printing the DictionaryPipeOp is not useful

I took this out but now Bernd asked me to include it again, and I think other chapters also show the content of their dictionaries.

provided as so-called PipeOps

changed the wording from "so-called" to "what we call 'PipeOps'", should make it clear we are introducing this as a new term here

To be consistent with the rest of the book change iris -> penguins

I will likely change the datasets around to make the effect of tuning the preprocessing step more obvious, but idk if it will end up being penguins or something else.

Don't need a separate section for $keep_results

Talked to Bernd about this, our conclusion was that debugging is an important topic, so we will keep this and instead try to move other related things to this section.

Pipelines Hyperparameter

After discussion with Bernd we decided to keep this

Make clear distinction between .train/train and .predict/predict

Are you sure you were not reading the extending.qmd file here? Also not exactly a constructive comment.

TODOs from Bernd:

[ ] remove abstract
[ ] expand first part into a conceptual intro (~ 2 pages) that, among other things, explains the necessity of preprocessing inside CV
[ ] (also move the relevant text from what is now page 9 there)
[ ] mention difference between pipeop composition and functional composition
[ ] "Machine Learning" -> "ML"
[ ] "Pipeline Operators" chapter stress difference between 'parameters' (state) and 'hyperparameters'
[ ] put some marginal things that disturb the flow into info-boxes.
[ ] use PCA instead of Scaling in example figure
[ ] First figure should also show the $predict()-part
[ ] show po() / mlr_pipeops output
[ ] 'unsupported state' -> 'invalid state', and add something along the lines of 'if you know what you are doing'
[ ] mark "non-sequential graphs" as skippable
[ ] mark proxy-tuning as skippable
[ ] add 'Example' before stacking, bagging etc.
[ ] Improve "Tuning" with something that shows stronger benefit of tuning the preprocessing part more clearly
[ ] Justify why some figures use pseudocode, others use lay terms

RaphaelS1 commented 1 year ago

I took this out but now Bernd asked me to include it again, and I think other chapters also show the content of their dictionaries.

They're not meant to, it's now all in an appendix at the end. @berndbischl let's discuss and align on this to prevent confusion, see appendix I've made here

changed the wording from "so-called" to "what we call 'PipeOps'", should make it clear we are introducing this as a new term here

To be consistent with other parts of the book this could simply be "provided as 'PipeOps'" the quote marks imply new term and elsewhere we do the same

but idk if it will end up being penguins or something else.

Why not penguins? We should be consistent with other parts of the book unless there's a clear reason not to be

re you sure you were not reading the extending.qmd file here?

Yeah sorry I think this might have been for old version

mb706 commented 1 year ago

Why not penguins?

we want to have an example where the benefit of tuning on preprocessing is shown clearly, e.g. where some well known and simple operation like PCA improves performance noticeably. idk if penguins will do that and will need to try out a few things

berndbischl commented 1 year ago

@mb706 pls choose a dataset which really works. doesnt matter if penguins or not

berndbischl commented 1 year ago

@RaphaelS1 a dataset working is a clear reason for me. we should not feature examples where the shown operation "doesn't work". that's somewhat hard to setup and we should therefore allow deviations

RaphaelS1 commented 1 year ago

@RaphaelS1 a dataset working is a clear reason for me. we should not feature examples where the shown operation "doesn't work". that's somewhat hard to setup and we should therefore allow deviations

I agree I wasn't saying we are forced to stick to one just that we need to be clear why we make changes when we do but more importantly to document every dataset we use throughout the book so we can explain them in the appendices

RaphaelS1 commented 1 year ago

[ ] Add discussion about backuplearner (or just strike out this issue) https://github.com/mlr-org/mlr3book/issues/281

mlr-org / mlr3book

Chapter 6 comments #455

Style

Content

General