Closed infotroph closed 7 years ago
I'm now suspecting K-S is too stringent a test -- with 20000 samples, we have nearly infinite power to resolve differences between distributions. For example, here are violin plots of all HMC samples for model intercepts, filtered to show only midsummer of each year:
ggplot(
(intercept_chains
%>% gather(crop, intercept, -year,-session)
%>% filter(session=="4"|!(year %in% c("2010", "2012")))
%>% mutate(annual = if_else(crop=="Maize-Soybean" & year %in% c("2010", "2013"), "Soybean", "Maize")),
aes(year, intercept, fill=annual))
+ geom_violin()
+ facet_wrap(~crop)
+ theme_ggEHD()
+ scale_fill_grey()
+ theme(legend.title=element_blank(), legend.position=c(0.16, 0.62))
)
Notice that the distributions for maize in 2011 and 2012 are near identical -- yes, 2012 is longer tailed and the median is arguably a touch greater, but I'd argue that the correct interpretation is that the posterior uncertainty in the intercept term is greater than the difference between the years. However:
ms11 = subset(intercept_chains, year=="2011" & session=="4")$`Maize-Soybean`
ms12 = subset(intercept_chains, year=="2012" & session=="4")$`Maize-Soybean`
ks.test(ms11, ms12)
# D = 0.13965, p-value < 2.2e-16
Those are nice plots. And you are right about the sensitivity of the ks test to the sample size; it is used to test if samples come from the same distribution, (and we know these don't!). In any case, it doesn't answer the correct question. I think here you want to know 'are the means different' and this is where the Kruske 2013 "Bayesian estimation supersedes the t-test" provides an intuitive approach to asking questions about the posterior parameter estimates represented by the mcmc sample chains.
Something like
are the intercept means different in 2012 and 2014? _if you compare 2011 to 2014 p = 0.0504 so I am comparing 2012 and 2014 because the difference is more clear
# first using data.table since I am more familiar with it
library(data.table)
beta <- intercept_chains %>% setDT
quantile(beta[year == 2012, ]$Miscanthus - beta[year == 2014, ]$Miscanthus, c(0.025, 0.975))
# 2.5% 97.5%
#-1.3688750 -0.2814875
# you can come up with a p-value equivalent of a two-tailed t-test thus:
mycdf <- ecdf(beta[year == 2012, ]$Miscanthus - beta[year == 2014, ]$Miscanthus)
(1 - mycdf(v = 0)) * 2
# 0.0012
Plotting the post-hoc comparison of intercepts with dashed lines to indicate 95% CI:
ggplot(data = data.frame(x = beta[year == 2014]$Miscanthus - beta[year == 2012]$Miscanthus)) +
geom_histogram(aes(x, ..density..), alpha = 0.5) +
geom_vline(aes(xintercept = quantile(x, 0.025)), linetype = 2) +
geom_vline(aes(xintercept = quantile(x, 0.975)), linetype = 2) +
xlab('difference in intercepts between 2012 and 2014') + theme_bw()
Above I forgot to add session=="4"|!(year %in% c("2010", "2012"))
... but now I see that this filters out only the sessions with peak biomass for 2010 and 2012 when there were many sampling dates ... so the results I presented are illustrative but not valid since they contain many dates from 2012.
Just to make sure I'm thinking about this correctly and reading Kruschke right: The order of the samples doesn't matter, nor does the fact they were generated from separate model runs, correct?
More concretely, for chains drawn from a single model run we say the elementwise subtraction of samples integrates over differences in the mutually conditional distributions of both parameters, so theta_2[i] - theta_1[i]
contains information about dependencies between parameters (say, correlation) that would be lost if we shuffled the chains, say theta_2[i] - sample(theta_1[i])
.
But here, we ran completely separate models and ignored conditionality, so both theta_2014[i] - theta_2012[i]
and theta_2014[i] - sample(theta_2011[i])
should give us the same distribution of marginal differences. This is OK and if ignoring conditionality was a problem we should have fit a time-resolved model. Right?
Here's what I think is fully working code, with the results as comments.
library(dplyr)
library(tidyr)
get_icpt = function(path){
rzenv = new.env()
load(path, envir=rzenv)
rz_fit = get("rz_mtd", envir=rzenv)
icpt = rstan::extract(rz_fit, pars="intercept")[[1]]
cropkey = get("cropkey", envir=rzenv)
colnames(icpt) = cropkey$name[match(seq_len(ncol(icpt)), cropkey$num)]
as.data.frame(icpt)
}
intercept_chains = (
list.files("~/UI/efrhizo/data/stan/", pattern="*.Rdata", full.names=TRUE)
%>% data.frame(src=., stringsAsFactors=FALSE)
%>% group_by(src)
%>% do(intfit=get_icpt(.$src))
%>% unnest()
%>% extract(src, c("year", "session"), regex="_(\\d{4})_s(\\d)"))
icpt_2010 = intercept_chains %>% filter(year=="2010", session=="4") %>% select(-year, -session)
icpt_2011 = intercept_chains %>% filter(year=="2011", session=="4") %>% select(-year, -session)
icpt_2012 = intercept_chains %>% filter(year=="2012", session=="4") %>% select(-year, -session)
icpt_2013 = intercept_chains %>% filter(year=="2013", session=="5") %>% select(-year, -session)
icpt_2014 = intercept_chains %>% filter(year=="2014", session=="2") %>% select(-year, -session)
# 95% intervals for differences between midsummer 2010 and 2014 intercept terms:
# Valid for all four crops, but not useful for maize-soybean
# because it compares 2010 soy against 2014 maize.
apply(icpt_2014 - icpt_2010, 2, quantile, c(0.025,0.975))
# Maize-Soybean Switchgrass Miscanthus Prairie
# 2.5% 0.1976101 0.3543609 0.5058077 0.6400177
# 97.5% 1.2978562 1.2703830 1.3982014 1.4966544
# To ask if soy intercept changed, compare the two soy years:
quantile(icpt_2013$`Maize-Soybean` - icpt_2010$`Maize-Soybean`, c(0.025,0.975))
# 2.5% 97.5%
# -0.3810014 0.7925172
# Ditto if maize changed, but comparing between all 3 years
quantile(icpt_2014$`Maize-Soybean` - icpt_2011$`Maize-Soybean`, c(0.025,0.975))
# 2.5% 97.5%
# -0.2843859 0.6472303
quantile(icpt_2014$`Maize-Soybean` - icpt_2012$`Maize-Soybean`, c(0.025,0.975))
# 2.5% 97.5%
# -0.3659589 0.6407363
quantile(icpt_2012$`Maize-Soybean` - icpt_2011$`Maize-Soybean`, c(0.025,0.975))
# 2.5% 97.5%
# -0.3765040 0.4751392
I'm reporting these in the manuscript as: "All three perennials increased their root volume (95% intervals for difference between midsummer 2010 and 2014 intercept terms (α) did not include 0) while maize and soybean did not (95% intervals for α differences among 2011, 2012, 2014 (maize) or between 2010 and 2013 (soybean) all included 0)."
TODO before closing: Save this analysis somewhere the Makefile can find it.
Right?
Yes that is exactly my understanding as well. You can of course test this - 'order matters within but not across models'. Always nice to have independent actual confirmation, see effect size of correlation etc. On Fri, Dec 2, 2016 at 1:45 PM Chris Black notifications@github.com wrote:
Here's what I think is fully working code, with the results as comments.
library(dplyr) library(tidyr)
get_icpt = function(path){ rzenv = new.env() load(path, envir=rzenv) rz_fit = get("rz_mtd", envir=rzenv) icpt = rstan::extract(rz_fit, pars="intercept")[[1]] cropkey = get("cropkey", envir=rzenv) colnames(icpt) = cropkey$name[match(seq_len(
ncol(icpt)), cropkey$num)] as.data.frame(icpt) }
intercept_chains = ( list.files("~/UI/efrhizo/data/stan/", pattern="*.Rdata", full.names=TRUE) %>% data.frame(src=., stringsAsFactors=FALSE) %>% group_by(src) %>% do(intfit=geticpt(.$src)) %>% unnest() %>% extract(src, c("year", "session"), regex="(\d{4})_s(\d)"))
icpt_2010 = intercept_chains %>% filter(year=="2010", session=="4") %>% select(-year, -session) icpt_2011 = intercept_chains %>% filter(year=="2011", session=="4") %>% select(-year, -session) icpt_2012 = intercept_chains %>% filter(year=="2012", session=="4") %>% select(-year, -session) icpt_2013 = intercept_chains %>% filter(year=="2013", session=="5") %>% select(-year, -session)icpt_2014 = intercept_chains %>% filter(year=="2014", session=="2") %>% select(-year, -session)
95% intervals for differences between midsummer 2010 and 2014 intercept terms: Valid for all four crops, but not useful for maize-soybean because it compares 2010 soy against 2014 maize.
apply(icpt_2014 - icpt_2010, 2, quantile, c(0.025,0.975))
Maize-Soybean Switchgrass Miscanthus Prairie
2.5% 0.1976101 0.3543609 0.5058077 0.6400177
97.5% 1.2978562 1.2703830 1.3982014 1.4966544
To ask if soy intercept changed, compare the two soy years:
quantile(icpt_2013$
Maize-Soybean
- icpt_2010$Maize-Soybean
, c(0.025,0.975))2.5% 97.5%
-0.3810014 0.7925172
Ditto if maize changed, but comparing between all 3 years
quantile(icpt_2014$
Maize-Soybean
- icpt_2011$Maize-Soybean
, c(0.025,0.975))2.5% 97.5%
-0.2843859 0.6472303
quantile(icpt_2014$
Maize-Soybean
- icpt_2012$Maize-Soybean
, c(0.025,0.975))2.5% 97.5%
-0.3659589 0.6407363
quantile(icpt_2012$
Maize-Soybean
- icpt_2011$Maize-Soybean
, c(0.025,0.975))2.5% 97.5%
-0.3765040 0.4751392
I'm reporting these in the manuscript as: "All three perennials increased their root volume (95% intervals for difference between midsummer 2010 and 2014 intercept terms (α) did not include 0) while maize and soybean did not (95% intervals for α differences among 2011, 2012, 2014 (maize) or between 2010 and 2013 (soybean) all included 0)."
TODO before closing: Save this analysis somewhere the Makefile can find it.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/infotroph/efrhizo/issues/7#issuecomment-264544884, or mute the thread https://github.com/notifications/unsubscribe-auth/AAcX5xb-yVjvsXKi_sz_nMmu3r8VfHZ1ks5rEHVZgaJpZM4LBpqn .
This strategy is now implemented in scripts/plot_chaindiffs.R. The implementation is kind of sprawling and resource-intensive, but it produces nice graphs and all the difference quantiles I wanted.
To do better stats than "95% intervals [did/did not] overlap", @dlebauer suggested Kolmogorov-Smirnov tests on the HMC samples -- this gives a nonparametric test of whether two samples are drawn from the same distribution.
I store the model fit objects from each run in an awkward format: Each fitted Stan model is inside an Rdata file, but all Rdata files contain the same object names so they clobber each other if you load multiple fits into the same workspace. Here's one way to extract the full chains.
This allows easier testing, but doesn't yet address the question of whether "intercept estimates came from the same distribution in year and year2" is the same as "The crop had the same amount of root in both years".