My holistic view of the chapter is that the content and structure are really strong, but the exposition remains rather abstract and would benefit from weaving through more concrete examples. We could also add a code boxes working through a test-retest calculation or cronbach's alpha calculation on an example dataset? (or make a version of Figure 8.6 from real data, plotting paired test-retest datapoints with lines connecting them?)
Below are some relatively small comments from a read-through.
Clarity: 'When we have a physical quantity of interest, we can assess how well an instrument measures that quantity. But things are much trickier when the construct we are trying to measure can’t be assessed directly'
This struck me as a bit circular and confusing this time around; I wasn't sure what 'directly assessing' physical measurement meant here. Ultimately, we're arguing that diagnostics on the instrument are our main way to assess, in which case psychological constructs aren't that different from physical ones? Maybe the difference we're trying to highlight is the layer of 'operationalization' in between the construct and the instrument reading? In which case we can rephrase?
Clarity: 'We are also going to talk in Chapter 9 about the validity of manipulations. The way you identify a causal effect on some measure is by operationalizing some construct as well. If this is done badly, the manipulation can be invalid – meaning the causal effect that’s measured doesn’t map onto the construct.'
The second sentence here was hard to parse (maybe change to 'To identify causal effects, we must link a particular construct of interest to something we can concretely manipulate in an experiment, like the stimuli or instructions.')
Clarity: 'Because bringing children into a lab can be expensive, one popular option for measuring child language [in their own homes] is the MacArthur Bates Communicative Development Inventory (CDI for short), a form which asks parents to mark words that their child says or understands.'
'The combination of reliability and validity evidence suggests that CDI[s] are a useful (and relatively inexpensive source)'
careful to use consistent pluralization?
Content: 'A reliable and valid measure of children’s vocabulary' box
I realized that parts of this may be too technical given that we haven't introduced these measures and techniques yet (the test-retest reliablity plot, the structural equation modeling, the correlation with shared variance). Could we nudge up the level of abstraction just a little to avoid readers feeling like they missed something they were supposed to already know? (e.g. describe it more conceptually, with less technical language?)
I might also remove Figure 8.3: Relations? (the SEM is pretty opaque and it's bumping other figures below where they appear in text).
Content: 'early controversies' box
I might move this to near the end of the chapter rather than having it at the top of 'reliability'?
Clarity: 'These scales are common for physical quantities but actually quite infrequent in psychology'
Aren't effect sizes measured on ratio scales (per Narens & Luce)? Psychology does do a lot of measuring effects as magnitudes with a meaningful zero? We do kind of talk about one effect being about twice the magnitude of another?
Layout: The Table 8.1 caption is way off in the side bar.
Content: "When noise is high, then the denominator is going to be big and will go down to 0; when noise is low, the numerator and the denominator will be almost the same and will approach 1."
It would be nice to introduce the more general concept of a 'noise ceiling' here (e.g. in cog neuro, the noise ceiling is how well you can predict one participant from other participants, and model performance is taken relative to that, so the principle is that you can define any relative comparison to a meaningful 'ceiling' of noise, and the overall variation, \sigma_o, is a particularly common choice?)
oops, just noticed this is in footnote 11 down below; I'd just move this up?
Clarity: 'computing the correlation between the two [sets of] scores'
this subsection is a little abstract and it might be nice to add an example to make it more concrete? e.g. 'suppose we've written two copies of an exam to measure a students' knowledge of the material. If every student comes back to take the second exam, we can ask how similar each students' scores are between the two exams, or take the correlation between the two sets of score.'
Clarity: 'One rule of thumb that’s helpful for individual difference designs of this sort is that the maximal correlation that can be observed between two variables and is the square root of the product of their reliabilities:'
Is there a citation for this? It's a handy rule of thumb and people might want to read more justification for it.
Mention in section 8.1.3 the other problem with test-retest, which is that people, unlike rocks, have memory and the samples are not independent? If you measure twice in a row using very similar instruments, you'll probably overestimate the reliability because people will be biased toward what they said the first time?
Layout: bold 'nomological network'?
Content: add concrete examples for 'Face validity', 'ecological validity', 'Internal validity ', and so on?
Typo: 'In addition, the CDI shows good ~concurrent~ [convergent] and predictive validity. ~Concurrently~ [In terms of convergence], the CDI ...]
Clarity: '~Across trials, both the volume and duration of the noise blast were sometimes analyzed.~ [Sometimes the analysis focused on the volume of the noise blast and sometimes it focused on the duration.]'
Typo: 'when ~if~ researchers adopt the CRTT'
Typo: '~The decision~ [Deciding] whether to'
Clarity: 'how to figure out people’s ability abstracted away from specific items' is hard to parse.
My holistic view of the chapter is that the content and structure are really strong, but the exposition remains rather abstract and would benefit from weaving through more concrete examples. We could also add a code boxes working through a test-retest calculation or cronbach's alpha calculation on an example dataset? (or make a version of Figure 8.6 from real data, plotting paired test-retest datapoints with lines connecting them?)
Below are some relatively small comments from a read-through.