infotroph / efrhizo

data & analysis code for UIUC/EBI Energy Farm minirhizotron project
1 stars 1 forks source link

Extracting HMC chains for distribution tests #7

Closed infotroph closed 7 years ago

infotroph commented 7 years ago

To do better stats than "95% intervals [did/did not] overlap", @dlebauer suggested Kolmogorov-Smirnov tests on the HMC samples -- this gives a nonparametric test of whether two samples are drawn from the same distribution.

I store the model fit objects from each run in an awkward format: Each fitted Stan model is inside an Rdata file, but all Rdata files contain the same object names so they clobber each other if you load multiple fits into the same workspace. Here's one way to extract the full chains.

library(dplyr)
library(tidyr)

get_icpt = function(path){
    rzenv = new.env()
    load(path, envir=rzenv)
    rz_fit = get("rz_mtd", envir=rzenv)
    icpt = rstan::extract(rz_fit, pars="intercept")[[1]]
    cropkey = get("cropkey", envir=rzenv)
    colnames(icpt) = cropkey$name[match(seq_len(ncol(icpt)), cropkey$num)]    
    as.data.frame(icpt)
}

intercept_chains = (
    list.files("data/stan/", pattern="*.Rdata", full.names=TRUE) 
    %>% data.frame(src=., stringsAsFactors=FALSE)
    %>% group_by(src)
    %>% do(intfit=get_icpt(.$src))
    %>% unnest()
    %>% extract(src, c("year", "session"), regex="_(\\d{4})_s(\\d)"))

This allows easier testing, but doesn't yet address the question of whether "intercept estimates came from the same distribution in year and year2" is the same as "The crop had the same amount of root in both years".

infotroph commented 7 years ago

I'm now suspecting K-S is too stringent a test -- with 20000 samples, we have nearly infinite power to resolve differences between distributions. For example, here are violin plots of all HMC samples for model intercepts, filtered to show only midsummer of each year:

ggplot(
    (intercept_chains 
        %>% gather(crop, intercept, -year,-session) 
        %>% filter(session=="4"|!(year %in% c("2010", "2012"))) 
        %>% mutate(annual = if_else(crop=="Maize-Soybean" & year %in% c("2010", "2013"), "Soybean", "Maize")),
    aes(year, intercept, fill=annual))
+ geom_violin()
+ facet_wrap(~crop)
+ theme_ggEHD()
+ scale_fill_grey()
+ theme(legend.title=element_blank(), legend.position=c(0.16, 0.62))
)

intercept_violin_midsummers_20161201

Notice that the distributions for maize in 2011 and 2012 are near identical -- yes, 2012 is longer tailed and the median is arguably a touch greater, but I'd argue that the correct interpretation is that the posterior uncertainty in the intercept term is greater than the difference between the years. However:

ms11 = subset(intercept_chains, year=="2011" & session=="4")$`Maize-Soybean`
ms12 = subset(intercept_chains, year=="2012" & session=="4")$`Maize-Soybean`
ks.test(ms11, ms12)
# D = 0.13965, p-value < 2.2e-16
dlebauer commented 7 years ago

Those are nice plots. And you are right about the sensitivity of the ks test to the sample size; it is used to test if samples come from the same distribution, (and we know these don't!). In any case, it doesn't answer the correct question. I think here you want to know 'are the means different' and this is where the Kruske 2013 "Bayesian estimation supersedes the t-test" provides an intuitive approach to asking questions about the posterior parameter estimates represented by the mcmc sample chains.

Something like

are the intercept means different in 2012 and 2014? _if you compare 2011 to 2014 p = 0.0504 so I am comparing 2012 and 2014 because the difference is more clear

# first using data.table since I am more familiar with it
library(data.table)
beta <- intercept_chains %>% setDT 

quantile(beta[year == 2012, ]$Miscanthus - beta[year == 2014, ]$Miscanthus, c(0.025, 0.975))
#      2.5%      97.5% 
#-1.3688750 -0.2814875 

# you can come up with a p-value equivalent of a two-tailed t-test thus:
mycdf <- ecdf(beta[year == 2012, ]$Miscanthus - beta[year == 2014, ]$Miscanthus)
(1 - mycdf(v = 0)) * 2
# 0.0012

Plotting the post-hoc comparison of intercepts with dashed lines to indicate 95% CI:

ggplot(data = data.frame(x = beta[year == 2014]$Miscanthus - beta[year == 2012]$Miscanthus)) +
  geom_histogram(aes(x, ..density..), alpha = 0.5) + 
  geom_vline(aes(xintercept = quantile(x, 0.025)), linetype = 2) + 
  geom_vline(aes(xintercept = quantile(x, 0.975)), linetype = 2) + 
  xlab('difference in intercepts between 2012 and 2014') + theme_bw()

image

dlebauer commented 7 years ago

Above I forgot to add session=="4"|!(year %in% c("2010", "2012")) ... but now I see that this filters out only the sessions with peak biomass for 2010 and 2012 when there were many sampling dates ... so the results I presented are illustrative but not valid since they contain many dates from 2012.

infotroph commented 7 years ago

Just to make sure I'm thinking about this correctly and reading Kruschke right: The order of the samples doesn't matter, nor does the fact they were generated from separate model runs, correct?

More concretely, for chains drawn from a single model run we say the elementwise subtraction of samples integrates over differences in the mutually conditional distributions of both parameters, so theta_2[i] - theta_1[i] contains information about dependencies between parameters (say, correlation) that would be lost if we shuffled the chains, say theta_2[i] - sample(theta_1[i]).

But here, we ran completely separate models and ignored conditionality, so both theta_2014[i] - theta_2012[i] and theta_2014[i] - sample(theta_2011[i]) should give us the same distribution of marginal differences. This is OK and if ignoring conditionality was a problem we should have fit a time-resolved model. Right?

infotroph commented 7 years ago

Here's what I think is fully working code, with the results as comments.

library(dplyr)
library(tidyr)

get_icpt = function(path){
    rzenv = new.env()
    load(path, envir=rzenv)
    rz_fit = get("rz_mtd", envir=rzenv)
    icpt = rstan::extract(rz_fit, pars="intercept")[[1]]
    cropkey = get("cropkey", envir=rzenv)
    colnames(icpt) = cropkey$name[match(seq_len(ncol(icpt)), cropkey$num)]    
    as.data.frame(icpt)
}

intercept_chains = (
    list.files("~/UI/efrhizo/data/stan/", pattern="*.Rdata", full.names=TRUE) 
    %>% data.frame(src=., stringsAsFactors=FALSE)
    %>% group_by(src)
    %>% do(intfit=get_icpt(.$src))
    %>% unnest()
    %>% extract(src, c("year", "session"), regex="_(\\d{4})_s(\\d)"))

icpt_2010 = intercept_chains %>% filter(year=="2010", session=="4") %>% select(-year, -session)
icpt_2011 = intercept_chains %>% filter(year=="2011", session=="4") %>% select(-year, -session)
icpt_2012 = intercept_chains %>% filter(year=="2012", session=="4") %>% select(-year, -session)
icpt_2013 = intercept_chains %>% filter(year=="2013", session=="5") %>% select(-year, -session)
icpt_2014 = intercept_chains %>% filter(year=="2014", session=="2") %>% select(-year, -session)

# 95% intervals for differences between midsummer 2010 and 2014 intercept terms: 
# Valid for all four crops, but not useful for maize-soybean 
# because it compares 2010 soy against 2014 maize.
apply(icpt_2014 - icpt_2010, 2, quantile, c(0.025,0.975))
#       Maize-Soybean Switchgrass Miscanthus   Prairie
# 2.5%      0.1976101   0.3543609  0.5058077 0.6400177
# 97.5%     1.2978562   1.2703830  1.3982014 1.4966544

# To ask if soy intercept changed, compare the two soy years:
quantile(icpt_2013$`Maize-Soybean` - icpt_2010$`Maize-Soybean`, c(0.025,0.975))
#       2.5%      97.5% 
# -0.3810014  0.7925172 

# Ditto if maize changed, but comparing between all 3 years
quantile(icpt_2014$`Maize-Soybean` - icpt_2011$`Maize-Soybean`, c(0.025,0.975))
#       2.5%      97.5% 
# -0.2843859  0.6472303 
quantile(icpt_2014$`Maize-Soybean` - icpt_2012$`Maize-Soybean`, c(0.025,0.975))
#       2.5%      97.5% 
# -0.3659589  0.6407363 
quantile(icpt_2012$`Maize-Soybean` - icpt_2011$`Maize-Soybean`, c(0.025,0.975))
#       2.5%      97.5% 
# -0.3765040  0.4751392 

I'm reporting these in the manuscript as: "All three perennials increased their root volume (95% intervals for difference between midsummer 2010 and 2014 intercept terms (α) did not include 0) while maize and soybean did not (95% intervals for α differences among 2011, 2012, 2014 (maize) or between 2010 and 2013 (soybean) all included 0)."

TODO before closing: Save this analysis somewhere the Makefile can find it.

dlebauer commented 7 years ago

Right?

Yes that is exactly my understanding as well. You can of course test this - 'order matters within but not across models'. Always nice to have independent actual confirmation, see effect size of correlation etc. On Fri, Dec 2, 2016 at 1:45 PM Chris Black notifications@github.com wrote:

Here's what I think is fully working code, with the results as comments.

library(dplyr) library(tidyr)

get_icpt = function(path){ rzenv = new.env() load(path, envir=rzenv) rz_fit = get("rz_mtd", envir=rzenv) icpt = rstan::extract(rz_fit, pars="intercept")[[1]] cropkey = get("cropkey", envir=rzenv) colnames(icpt) = cropkey$name[match(seq_len(

ncol(icpt)), cropkey$num)] as.data.frame(icpt) }

intercept_chains = ( list.files("~/UI/efrhizo/data/stan/", pattern="*.Rdata", full.names=TRUE) %>% data.frame(src=., stringsAsFactors=FALSE) %>% group_by(src) %>% do(intfit=geticpt(.$src)) %>% unnest() %>% extract(src, c("year", "session"), regex="(\d{4})_s(\d)"))

icpt_2010 = intercept_chains %>% filter(year=="2010", session=="4") %>% select(-year, -session) icpt_2011 = intercept_chains %>% filter(year=="2011", session=="4") %>% select(-year, -session) icpt_2012 = intercept_chains %>% filter(year=="2012", session=="4") %>% select(-year, -session) icpt_2013 = intercept_chains %>% filter(year=="2013", session=="5") %>% select(-year, -session)icpt_2014 = intercept_chains %>% filter(year=="2014", session=="2") %>% select(-year, -session)

95% intervals for differences between midsummer 2010 and 2014 intercept terms: Valid for all four crops, but not useful for maize-soybean because it compares 2010 soy against 2014 maize.

apply(icpt_2014 - icpt_2010, 2, quantile, c(0.025,0.975))

Maize-Soybean Switchgrass Miscanthus Prairie

2.5% 0.1976101 0.3543609 0.5058077 0.6400177

97.5% 1.2978562 1.2703830 1.3982014 1.4966544

To ask if soy intercept changed, compare the two soy years:

quantile(icpt_2013$Maize-Soybean - icpt_2010$Maize-Soybean, c(0.025,0.975))

2.5% 97.5%

-0.3810014 0.7925172

Ditto if maize changed, but comparing between all 3 years

quantile(icpt_2014$Maize-Soybean - icpt_2011$Maize-Soybean, c(0.025,0.975))

2.5% 97.5%

-0.2843859 0.6472303

quantile(icpt_2014$Maize-Soybean - icpt_2012$Maize-Soybean, c(0.025,0.975))

2.5% 97.5%

-0.3659589 0.6407363

quantile(icpt_2012$Maize-Soybean - icpt_2011$Maize-Soybean, c(0.025,0.975))

2.5% 97.5%

-0.3765040 0.4751392

I'm reporting these in the manuscript as: "All three perennials increased their root volume (95% intervals for difference between midsummer 2010 and 2014 intercept terms (α) did not include 0) while maize and soybean did not (95% intervals for α differences among 2011, 2012, 2014 (maize) or between 2010 and 2013 (soybean) all included 0)."

TODO before closing: Save this analysis somewhere the Makefile can find it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/infotroph/efrhizo/issues/7#issuecomment-264544884, or mute the thread https://github.com/notifications/unsubscribe-auth/AAcX5xb-yVjvsXKi_sz_nMmu3r8VfHZ1ks5rEHVZgaJpZM4LBpqn .

infotroph commented 7 years ago

This strategy is now implemented in scripts/plot_chaindiffs.R. The implementation is kind of sprawling and resource-intensive, but it produces nice graphs and all the difference quantiles I wanted.