I'm working through the book sequentially and these are my comments on chapter 8. I've also made some direct edits to the text for more minor things and writing adjustments and issued a pull request.
"But, to paraphrase Norman R. Campbell and Jeffreys (1938), not every assignment of numbers is measurement!" <- can we expand on why? Otherwise its a just an assertion.
"This point is obvious when you think about physical measurement instruments: a caliper will give you a much more precise estimate of the thickness of a small object than a ruler. " <- do we want to distinguish between accuracy (validity) here and precision?
"When we have a physical quantity of interest, we can assess how well an instrument measures that quantity." <- hmm. We can only assess how well an instrument measures something by comparing to whatever we consider the gold standard measurement, which still requires an instrument of some kind. And we can do that for both physical measurement and measurement of abstract psychological constructs — so I don't think that's a relevant difference. Presumably the difference is that physical measurement is more direct?
Fig 8.2 'starting age' in legend needs units
"contrasting it with derived measurement which was some function of fundamental measures" - unclear
"It is then left up to the researcher to decide which scale type their proposed measure should belong to" - we haven't yet explained what a 'scale type' is
"Now we can use these measurements to compute the coefficient of variation, which is 0.005" - perhaps briefly say how we interpret that?
"This idea of convergent validity is precisely the circularity of Cronbach and Meehl’s “nomological network” idea" <- well that seems bad! Perhaps a brief comment here to explain why most researchers rely on this framework anyway? (presumably because its the best we can do)
"Face validity" - face validity basically just sounds like 'common sense', but as psychologists we know common sense can be wrong. And I'm sure there are examples where some non-intuitive instrument is actually a good measure of something (though I can't think of any OTH!). We should perhaps point out that face validity is generally considered the weakest of these validity criterion (I think?) for these reasons (I think?).
"Predictive validity. If the measure predicts other later measures of the construct" <- this sounds very similar to convergent validity. I thought predictive validity was predicting other relevant outcomes you'd expect to be related to the construct, rather than measures of the construct itself? e.g., we'd expect a measure of educational achievement to predict future salary.
"Divergent validity. If the measure can be shown to be distinct from measure(s) of a different construct" <- a practical example here would be helpful I think — how do you select a 'different construct' to compare to? [even though we have a more extended example in the next section, inserting brief practical examples for all of these types of validity in these bullet points would be helpful I think, to crystallise what we mean].
Figure 8.7 is too small to be legible!
Accident report - "Talk about flexible measurement!" <- in one sense using a range of different measures is a good thing (triangulation), so perhaps clarify/highlight that the problem here is that its being done non-systematically and opaquely, with ad-hoc, data-dependent measurement decisions creating the concern about analytic flexibility. We could say something like "Superficially, this might seem like a good thing — if we gather converging evidence from multiple different operationalisations of a construct, doesn't that mean our findings are more robust? Whilst that is true, the problem here is that the measurement decisions are being made in an ad-hoc, data-dependent manner, which create analytic flexibility and increases the risk of bias. A careful attempt to examine different measurement operationalisations would involve making measurement decisions independent of the data (i.e., preregistering them) and making changes to the measurement instrument systematically (i.e., changing one thing at a time to evaluate its influence on the results)."
"that may be a signal that something has gone wrong" <- a quick example of such a problem might be helpful
Figure 8.8 text is pretty small. What is "short response"? Perhaps add another example to the complex end to balance it out a bit e.g., "body movements" (from the main text)
"Try to make the order reasonable" <- can we make this a bit more helpful, what's a reasonable order?
"We’ll talk in Chapter 12 about manipulation checks and their strengths and weaknesses." - wasn't clear to me how this sidenote relates to the main text its attached to, which is about tricky survey questions
Depth box "Likert scales" <- unclear what the defining feature of a likert scale is; are bipolar and unipolar scales types of likert scale?
"You should also consider the names you give to your scale up front to try to minimize these issues." <- not sure what this means, do we mean the labels?
"It rarely helps matters to add a “don’t know” or “other” option to survey questions" <- though I wonder if its helpful to include these during instrument development because it can flag when your responses are missing something. That's a speculative and tangential point so no need to mention it!
"pragmatically infelicitous negation" <- can we translate that into human?? :)
"On that viewpoint, theories are best tested by observing measurements that they predict but that are low probability according to others." <- needs rewording I think. Perhaps clearer to say observing 'results'. And do we mean according to other people? Or other theories? Bit hard to parse what the 'low probability' part means.
"The more measures you add, the more bets you are making but the less value you are putting on each. In essence, you are hedging your bets and so the success of any one bet is less convincing." <- I'm not really convinced by this as written. Unclear why measuring more means putting less value on each bet. And you're increasing the number of opportunities to be right, but aren't you also increasing the number of opportunities to be wrong, which is more falsifiable — a good thing by Popper's standards
" if you include multiple measures in your experiment, you need to think about how you will interpret inconsistent results." <- I'm also unconvinced by this. I'm certainly very tempted to sign up to this 'simple is best' philosophy (and often adopt it in practice) because inconsistent results are indeed practically annoying, but the reality is inconsistent results, then isn't that just something we have to deal with? Wouldn't it be better to know than not know? As we say elsewhere in the chapter, ignorance is not bliss.
"The fundamental insight of the psychometric perspective is that the constructs we study as psychologists are latent, rather than directly observed. " <- maybe I'm missing the point here but isn't that just obvious? The insights offered by psychometrics are a bit more sophisticated than that (how to measure latent constructs)? [having re-read a few times, I wonder if we actually just mean to say something like "the fundamental issue addressed by the psychometric perspective is how to measure constructs that we cannot directly observe"]
"adoption of defaults" <- not super clear what this means or how it relates back to the chapter — do we mean just relying on measures other people are using even if they haven't been validated?
I'm working through the book sequentially and these are my comments on chapter 8. I've also made some direct edits to the text for more minor things and writing adjustments and issued a pull request.
"But, to paraphrase Norman R. Campbell and Jeffreys (1938), not every assignment of numbers is measurement!" <- can we expand on why? Otherwise its a just an assertion.
"This point is obvious when you think about physical measurement instruments: a caliper will give you a much more precise estimate of the thickness of a small object than a ruler. " <- do we want to distinguish between accuracy (validity) here and precision?
"When we have a physical quantity of interest, we can assess how well an instrument measures that quantity." <- hmm. We can only assess how well an instrument measures something by comparing to whatever we consider the gold standard measurement, which still requires an instrument of some kind. And we can do that for both physical measurement and measurement of abstract psychological constructs — so I don't think that's a relevant difference. Presumably the difference is that physical measurement is more direct?
Fig 8.2 'starting age' in legend needs units
"contrasting it with derived measurement which was some function of fundamental measures" - unclear
"It is then left up to the researcher to decide which scale type their proposed measure should belong to" - we haven't yet explained what a 'scale type' is
"Now we can use these measurements to compute the coefficient of variation, which is 0.005" - perhaps briefly say how we interpret that?
"This idea of convergent validity is precisely the circularity of Cronbach and Meehl’s “nomological network” idea" <- well that seems bad! Perhaps a brief comment here to explain why most researchers rely on this framework anyway? (presumably because its the best we can do)
"Face validity" - face validity basically just sounds like 'common sense', but as psychologists we know common sense can be wrong. And I'm sure there are examples where some non-intuitive instrument is actually a good measure of something (though I can't think of any OTH!). We should perhaps point out that face validity is generally considered the weakest of these validity criterion (I think?) for these reasons (I think?).
"Predictive validity. If the measure predicts other later measures of the construct" <- this sounds very similar to convergent validity. I thought predictive validity was predicting other relevant outcomes you'd expect to be related to the construct, rather than measures of the construct itself? e.g., we'd expect a measure of educational achievement to predict future salary.
"Divergent validity. If the measure can be shown to be distinct from measure(s) of a different construct" <- a practical example here would be helpful I think — how do you select a 'different construct' to compare to? [even though we have a more extended example in the next section, inserting brief practical examples for all of these types of validity in these bullet points would be helpful I think, to crystallise what we mean].
Figure 8.7 is too small to be legible!
Accident report - "Talk about flexible measurement!" <- in one sense using a range of different measures is a good thing (triangulation), so perhaps clarify/highlight that the problem here is that its being done non-systematically and opaquely, with ad-hoc, data-dependent measurement decisions creating the concern about analytic flexibility. We could say something like "Superficially, this might seem like a good thing — if we gather converging evidence from multiple different operationalisations of a construct, doesn't that mean our findings are more robust? Whilst that is true, the problem here is that the measurement decisions are being made in an ad-hoc, data-dependent manner, which create analytic flexibility and increases the risk of bias. A careful attempt to examine different measurement operationalisations would involve making measurement decisions independent of the data (i.e., preregistering them) and making changes to the measurement instrument systematically (i.e., changing one thing at a time to evaluate its influence on the results)."
"that may be a signal that something has gone wrong" <- a quick example of such a problem might be helpful
Figure 8.8 text is pretty small. What is "short response"? Perhaps add another example to the complex end to balance it out a bit e.g., "body movements" (from the main text)
"Try to make the order reasonable" <- can we make this a bit more helpful, what's a reasonable order?
"We’ll talk in Chapter 12 about manipulation checks and their strengths and weaknesses." - wasn't clear to me how this sidenote relates to the main text its attached to, which is about tricky survey questions
Depth box "Likert scales" <- unclear what the defining feature of a likert scale is; are bipolar and unipolar scales types of likert scale?
"You should also consider the names you give to your scale up front to try to minimize these issues." <- not sure what this means, do we mean the labels?
"It rarely helps matters to add a “don’t know” or “other” option to survey questions" <- though I wonder if its helpful to include these during instrument development because it can flag when your responses are missing something. That's a speculative and tangential point so no need to mention it!
"pragmatically infelicitous negation" <- can we translate that into human?? :)
"On that viewpoint, theories are best tested by observing measurements that they predict but that are low probability according to others." <- needs rewording I think. Perhaps clearer to say observing 'results'. And do we mean according to other people? Or other theories? Bit hard to parse what the 'low probability' part means.
"The more measures you add, the more bets you are making but the less value you are putting on each. In essence, you are hedging your bets and so the success of any one bet is less convincing." <- I'm not really convinced by this as written. Unclear why measuring more means putting less value on each bet. And you're increasing the number of opportunities to be right, but aren't you also increasing the number of opportunities to be wrong, which is more falsifiable — a good thing by Popper's standards
" if you include multiple measures in your experiment, you need to think about how you will interpret inconsistent results." <- I'm also unconvinced by this. I'm certainly very tempted to sign up to this 'simple is best' philosophy (and often adopt it in practice) because inconsistent results are indeed practically annoying, but the reality is inconsistent results, then isn't that just something we have to deal with? Wouldn't it be better to know than not know? As we say elsewhere in the chapter, ignorance is not bliss.
"The fundamental insight of the psychometric perspective is that the constructs we study as psychologists are latent, rather than directly observed. " <- maybe I'm missing the point here but isn't that just obvious? The insights offered by psychometrics are a bit more sophisticated than that (how to measure latent constructs)? [having re-read a few times, I wonder if we actually just mean to say something like "the fundamental issue addressed by the psychometric perspective is how to measure constructs that we cannot directly observe"]
"adoption of defaults" <- not super clear what this means or how it relates back to the chapter — do we mean just relying on measures other people are using even if they haven't been validated?