Open ktoddbrown opened 7 years ago
I'm not sure how we could optimize the power. For any scheme we propose, we can measure it. Given the random nature of all these tests, fine grain distinctions usually aren't worth it and require a ton of computation time.
Plus, the models should recover the parameters in their posterior intervals, but we want those posterior intervals to be as narrow as possible---that is, we want to identify them tightly. The posterior interval width goes as narrow as O(1 / sqrt(N)) for N i.i.d. observations, but we're dealing with a time series, so I'm not sure what happens. I'll bring this up with Andrew and others at our meeting Thursday to see if anyone already knows the answer or has a suggestion on how to proceed. I know they do power tests to design clinical trials in pharmacological comparment models.
On Jan 18, 2017, at 2:50 PM, Kathe Todd-Brown notifications@github.com wrote:
This is something I definitely want to be central to the manuscript: How frequently do you need to measure to get good estimates on the parameters for the first order linear models?
I think a reasonable sampling frequency from a lab work point of view is once a day (maybe twice if there is a lot of statistical power in that second measurement). In theory, you could do this throughout the incubation (yea undergrad labor!) but in practice sampling frequency tends to fall off the longer the study runs.
I've been proposing daily samples for the first 1-2 weeks, weekly samples for the first 1-2 months and then monthly samples there after as sort of a gut reaction when asked. However I would love to have some actual statistics to back up this knee jerk reaction.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
From an experimental standpoint the question is: is daily measurements enough? Especially at the beginning of the simulation. Generally only a finite number of samples can be processed at any one time however there is ($$$$$$) equipment to automate gas draws and provide hourly or finer resolution. Is it worth getting that higher temporal resolution for early in the experiment? Or is a courser time resolution adequate?
Do the automated gas draws have better or worse measurement error than when people do it? How much measurement you need is inversely related to how good the measurements are.
My guess from all the other things I've seen like this that coarse measurements will be sufficient. More replicates with fewer measurements per replicate are even better if we want to infer population parameters. This goes back to the work Andrew did on serial dillution assays (in BDA) and also everything else I've seen. For example, this study of measuring pediatric lung clearance (they gas the kids, then see how long it takes their lungs to clear; like carbon, it's a compartment mixture model, only we know the kids have two lungs and the diff eq has an analytic solution):
https://arxiv.org/pdf/1612.08617.pdf
Milad---did you manage to track down any of the PK/PD experimental design stuff we were talking about in the Stan meeting the other week?
I can give you the experimentalist's perspective. I agree that automated systems are $$ and they often reduce the number replicates used. When people are doing the gas draws and injections, we limit the number of injectors. Ideally, the same person does all the injections. There is bias in our measurements, but we are usually focusing on treatment differences. I haven't directly compared human vs. automated injections though.
If this is interesting to you I could give you a spreadsheet of one person's repeated measurement of a set of gas standards over time. Every time we sample the experiment, we draw 5 times each from 3 different CO2 standard tanks. However, there will be error due to instrument drift with such a dataset.
Bob---I thought the conclusion of that meeting was that there is no easy way to do it, and the only way would be a grid-like test (which is what I'm doing right now and will send out a report soon). Do you remember which work Bill was mentioning in the meeting? I didn't take notes unfortunately.
Yes, France Mentre, who does PKPD, has done a bunch of work with students and postdocs on power analysis for PKPD models. Those are also compartment ODEs. I don't think we got a specific paper recommendation. She's been using Stan to do this recently. Like this one:
or this one:
http://www.hal.inserm.fr/inserm-01076940/document
We can ask France what to read, too.
When you say there's bias in the measurement, you mean in the statistical sense of being on average lower than they should be or on average higher than they should be? Is that what you mean by instrument drift? That can cause real problems. The nice part about Bayesian modeling is that if we know it's there, we can try to measure and model it. And it's important to do so if there's real bias that affects estimates.
Aside from this project, I'm interested in calibrating measurements, be they diagnostic tests in epidemiology or human coders in a machine-learning data set creation task. This can be done with straight-up measurement-error models where you assume there's a (latent) true value being measured and each person (test, etc.) gives you a measurement. Often you can estimate the accuracy and biases of the measurements this way and adjust for them. To get off the ground at all, we need replicates that are supposed to have the same measurements, and if there's noise in creating the samples, then we need replicates to calibrate the inter-sample noise.
One thing you could do is have a different person do every other measurement and see if they lead to consistent estimates. If not, you know there are problems with your measurement instruments (in this case, the people doing the lab work).
On Feb 14, 2017, at 1:19 PM, SeanSchaeffer notifications@github.com wrote:
I can give you the experimentalist's perspective. I agree that automated systems are $$ and they often reduce the number replicates used. When people are doing the gas draws and injections, we limit the number of injectors. Ideally, the same person does all the injections. There is bias in our measurements, but we are usually focusing on treatment differences. I haven't directly compared human vs. automated injections though.
If this is interesting to you I could give you a spreadsheet of one person's repeated measurement of a set of gas standards over time. Every time we sample the experiment, we draw 5 times each from 3 different CO2 standard tanks. However, there will be error due to instrument drift with such a dataset.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Oh dear, I've opened a bit of a can of worms. When I mentioned bias it was in the first sense that you mentioned above. There might be some error associated with the person taking the measurement that would affect accuracy (avg. higher or lower than it should be). Since we often conduct manipulative experiments under controlled conditions, we tend focus on the precision of our measurements. In other words, I guess we focus mostly on being able to differentiate treatment effects. Thus, having just one person collect the measurements ought to give more precise estimates. But I don't think anyone has quantified what the bias is, if any. In soil ecology (and ecology in general), we have a big struggle with both low replication (because of cost, time, etc.) and the fact that the biological systems we work with are very heterogeneous and hard to measure consistently. I hope I'm using accuracy and precision in the correct statistical sense.
I mentioned instrument drift in a different context. I was thinking of ways we could use existing data to look at the error associated with the person taking the measurement. One possibility is to look at different persons measurements, over time, of the same sample. Everyone in my lab has to use the same set of standards when they measure headspace CO2 from jars. Now, these standards are what we use to correct for instrument drift. So, with the approach I mentioned I don't think we could separate person vs. instrument error. I think what you mentioned about alternating persons is better, but I don't have any existing data like that.
I do want to make it clear that we do account and correct for instrumental drift in our CO2 measurements. I'd be happy to show you our protocol, or explain it over skype/phone if you like.
On Feb 15, 2017, at 11:48 AM, SeanSchaeffer notifications@github.com wrote:
Oh dear, I've opened a bit of a can of worms. When I mentioned bias it was in the first sense that you mentioned above. There might be some error associated with the person taking the measurement that would affect accuracy (avg. higher or lower than it should be).
I want to clearly separate the issue of bias and variance. Let's say I have a scale, as it's an easy measurement example to imagine. If the expected value is the true value, it's unbiased. It might read a basket of vegetables that 1.5 pounds as 1.5 plus or minus normal(0, 0.1) noise. That would have some noise, but it'd be unbiased, because the noise is centered around 0. Now if it reads 1.5 plus or minus normal(0.1, 0.1), then it'll be biased to the high side---the expected measurement is 1.6 pounds rather than 1.5 pounds. But these both have the same variance. On the other hand, if I have a scale that returns 1.5 pounds plus or minus normal(0, 0.5) pounds, that's much noiser (higher variance) than one that returns 1.5 pounds plus or minus normal(0, 0.01) pounds, but both are unbiased.
So bias is systematic error in one direction or the other. A grocer putting their thumb on the scale will bias the analyses to the high side.
Since we often conduct manipulative experiments under controlled conditions, we tend focus on the precision of our measurements. In other words, I guess we focus mostly on being able to differentiate treatment effects. Thus, having just one person collect the measurements ought to give more precise estimates.
That's a common misperception. It'll give the appearance of more precise estimates, but your uncertainty won't be well calibrated. That's because you're ignoring the variation among the population of measurers in the uncertainty in your parameter estimates. Whenever you ignore uncertainty in the underlying process, you underestimate the uncertainty in your answers.
But I don't think anyone has quantified what the bias is, if any.
It's an important problem that usually gets swept under the rug. That's why I'm interested in it general. I have no idea about the process here, but it's a huge deal in humans curating gold-standard data sets in machine learning and it's obviously a huge deal in medical diagnostic tests (like x-rays, MRIs or puncture tests or blood tests).
In soil ecology (and ecology in general), we have a big struggle with both low replication (because of cost, time, etc.) and the fact that the biological systems we work with are very heterogeneous and hard to measure consistently. I hope I'm using accuracy and precision in the correct statistical sense.
See above. For classical point estimators, we tend to decompose error into a variance term and a bias term. The sum of those gives you expected error in the estimates.
In a Bayesian model, our answers are themselves probabilistic, and we're instead worried about calibration. That is if we say we're 50% sure a parameter is in a range, then we want that to be right 50% of the time (give or take sampling error). That's just like when the weather person predicts rain---if they say 10% on 100 days, you should get binomial(100, 0.1) days of rain if their model is well calibrated (you expect 10 of those days to be rain, bu there will be noise).
I mentioned instrument drift in a different context. I was thinking of ways we could use existing data to look at the error associated with the person taking the measurement. One possibility is to look at different persons measurements, over time, of the same sample. Everyone in my lab has to use the same set of standards when they measure headspace CO2 from jars. Now, these standards are what we use to correct for instrument drift. So, with the approach I mentioned I don't think we could separate person vs. instrument error. I think what you mentioned about alternating persons is better, but I don't have any existing data like that.
Unless we can get a handle on device error, we just have to take the error of machine plus human as a compound measurement device that will have a combined error. That we can measure with replicates, but if they're not exact replicates, you need more to tease the noise out.
I do want to make it clear that we do account and correct for instrumental drift in our CO2 measurements. I'd be happy to show you our protocol, or explain it over skype/phone if you like.
If you have a writeup somewhere, I'd love to see it. Otherwise, yes, an online thing would be good (Google hangouts? I can't get Skype to work on my Mac).
Thank you for clarifying things.
Respiration-IRGA_MEedits.docx.docx
Here is the latest iteration of our gas sampling protocol. This is my first time attaching files in Github so let me know if it doesn't work.
Essentially, we run a stream of CO2-free air through an infrared gas analyzer (IRGA). We then collect samples using a syringe from incubation jars or standard gas cylinders. Syringe draws are injected into the stream and peak height recorded. The injected standards are then used to convert peak height to [CO2]. The standards also help us to correct for instrument drift over time (due to changes in air pressure and temperature).
Thanks --- attachment worked fine.
This is something I definitely want to be central to the manuscript: How frequently do you need to measure to get good estimates on the parameters for the first order linear models?
I think a reasonable sampling frequency from a lab work point of view is once a day (maybe twice if there is a lot of statistical power in that second measurement). In theory, you could do this throughout the incubation (yea undergrad labor!) but in practice sampling frequency tends to fall off the longer the study runs.
I've been proposing daily samples for the first 1-2 weeks, weekly samples for the first 1-2 months and then monthly samples there after as sort of a gut reaction when asked. However I would love to have some actual statistics to back up this knee jerk reaction.