fix provenance in model

lizzieinvancouver commented 5 years ago

R3 complained about use of provenance and I think R3 had a point. Provenance is really about local adaptation so if we mix studies with real provenance info and those without it (i.e., where we don't really know where the seed came from) I think we are mushing an important issue in our latitude model.

To that end I just ran the code at the top of models_stan_lat.R and looked at what studies it uses (@cchambe12 is this the correct code to use?) and reviewed each paper quickly (save for one of the heide93 papers, I cannot find both so we should check but based on the papers from As, I think we are okay).

> sort(unique(lat.stan$datasetID))
 [1] "basler12"     "basler14"     "caffarra11a"  "caffarra11b"  "calme94"      "campbell75"  
 [7] "falusi03"     "falusi90"     "falusi96"     "falusi97"     "ghelardini10" "heide93"     
[13] "heide93a"     "laube14a"     "laube14b"     "li05"         "myking95"     "myking97"    
[19] "myking98"     "partanen01"   "partanen98"   "rinne94"      "Sanz-Perez09" "skuterud94"  
[25] "spann04"      "spiers74"     "webb78"       "worrall67"    "zohner16"

Okay ...

basler12 and basler14 -- has two distinct populations (same ones)
calme94 -- fine, authors give clear provenance
campbell75 -- fine, authors give clear provenance
falusi90 -- should be fine, authors give provenance info and that's what we entered
falusi96 97 and 03 -- fine, authors give clear provenance (and for at least falusi96 we entered all separately)
ghelardini10 -- fine, authors give clear provenance (and looks like we entered all separately)
heide93 -- just says collected at As, but probably fine - heide93a -- where is this paper?
laube14a and laube14b -- one paper is clear these come from a forest, other is not as clear (probably okay)
li05 -- fine, authors give clear provenance (and looks like we entered all separately)
myking95 -- fine, authors give clear provenance (and looks like we entered all separately)
myking97 -- fine, authors give clear provenance (and looks like we entered all separately)
myking98 -- seems fine
partanen01 -- fine, authors give clear provenance (and looks like we entered all separately)
partanen98 -- fine, authors give clear provenance (and looks like we entered all separately)
rinne94 -- just says seeds were collected in Northern Finland so probably okay
skuterud94 -- fine, authors give clear provenance (and looks like we entered all separately)
worrall67 -- fine

Less great ...

caffarra11a and caffara11b-- both use IPG clones originally from Germany
Sanz-Perez09 -- hard to tell, but their 'Plant Material' section gives me pause
spann04 -- these come from a commercial blueberry nursery
spiers74 -- crops again, probably not really provenance
webb78 -- a mix of seed lot and from another spot, so not okay
zohner16 -- not classical provenance, as the reviewer points out

As you can see the VAST majority of studies were really clear about provenance.

I am not sure what to do next and could use advice from @cchambe12 .... would it be possible to re-run the lat models excluding these 'less great' studies? Do we deal with this some other way?

Any help appreciated! @AileneKane @dbuona @cchambe12 ... all thoughts and opinions welcome.

Side note: I thought we were more careful about what we put in 'provenance' but not really -- it seems you can guess when it's right if 'population' is also entered but we should be clear what we mean here when we post the data to KNB.

AileneKane commented 5 years ago

@lizzieinvancouver I have two (somewhat opposing) thoughts about this: 1) If it's relatively easy to fit the latitude model with just the "okay" studies, that seems like a useful thing to do. It may change things, though, if lots of species need to be excluded, but I suppose it would be useful to know this. 2) I think the model is fine as it is. Our study is not intended to be about population genetics or genetic differences in budburst responses. I think the definition of "provenance latitude" in the meta-data associated with OSPREE is reasonable and useful "Latitude of samples collected." I think the problem may be more in the name, which is misleading, than in the way data are listed in the database. My understanding was that, from fairly early on in the OSPREE database development, the intention of this column was not to strictly adhere to "provenance" in the way that people traditionally think of from botanic gardens or population genetic studies.

lizzieinvancouver commented 5 years ago

@AileneKane I think 1. would be good; re: 2 -- I think we can do that if we must but I'd prefer not to. When I helped Dan and Tim set up the database my intention of 'provenance' was to treat it as is traditionally done. I think what we strayed from that (understandably!) and the provenance column right now is not provenance as used in ecology or evolution, unfortunately. I think we also need to be more careful here -- the studies we have cited about HOW provenance may affect photoperiod have nothing to do with provenance where tissue was grown, they have to do with local adaptation. I'd rather get this right if we can.

AileneKane commented 5 years ago

The other thing to keep in mind with this is that provenance latitude (I.e. the latitude at which samples were collected) is used to pull climate data and estimate chilling.

cchambe12 commented 5 years ago

@lizzieinvancouver @AileneKane I will run this new model version today and report back here. Thanks, Lizzie, for going through all of the papers and sorting them for me! This is a tricky situation though because it would mean changing large portions of the data including---as Ailene mentions---the climate data. I toyed with the idea of trying a new model where I pull climate data from the growing latitudes but then I realized we never thoroughly cleaned that column, nor would it capture the crucial field sampling dates. I would be happy to Skype today to discuss this more in depth once I get the new results from the new model.

cchambe12 commented 5 years ago

@lizzieinvancouver In regards to spann04 and spiers74, should those have been weeded out when we removed crops? Do I need to fix that code?

cchambe12 commented 5 years ago

By removing the less great studies, I received the following warnings:

Warning messages:
1: There were 253 divergent transitions after warmup. Increasing adapt_delta above 0.99 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 
2: There were 1 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low 
3: Examine the pairs() plot to diagnose sampling problems

4: The largest R-hat is 1.16, indicating chains have not mixed.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#r-hat 
5: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess 
6: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess

lizzieinvancouver commented 5 years ago

@cchambe12 Thank you for this! Could you push the code so I can see what's up?

Regarding spann04 and spiers74, they do appear to be commercial blueberries ... so maybe we we forgot to toss out commercial blueberries in our crop code? Would be great if you could check.

I don't think we need to worry so much about the chilling ... for the most part I think the provenance column should be correct for using that. At least in the papers I looked at yesterday it seemed okay.

cchambe12 commented 5 years ago

@lizzieinvancouver Great about the chilling!! And yes, I did push it. It's nested within the earlier lat model code: analyses/lat_analysis/models_stan_lat.R

I will definitely check on the blueberries!

lizzieinvancouver commented 5 years ago

@cchambe12 Okay! Sorry, I see it now. Can you let me know what model you're running? I just tried m2l.inter and got only 4 divergent transitions ... I imagine those will go away with a higher number of iterations.

cchambe12 commented 5 years ago

@lizzieinvancouver Weird!!! Me too! I'm rerunning it now on a new computer. I'll keep you posted! Hopefully something was just wonky the first time I ran it...

cchambe12 commented 5 years ago

@lizzieinvancouver Woohoo!! Okay no idea what was going on but new warning messages below..

Warning messages: 1: There were 1 divergent transitions after warmup. Increasing adapt_delta above 0.99 may help. See http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 2: Examine the pairs() plot to diagnose sampling problems 3: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable. Running the chains for more iterations may help. See http://mc-stan.org/misc/warnings.html#bulk-ess 4: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable. Running the chains for more iterations may help. See http://mc-stan.org/misc/warnings.html#tail-ess

And check_all_diagnostics: [1] "n_eff / iter looks reasonable for all parameters" [1] "Rhat looks reasonable for all parameters" [1] "1 of 4000 iterations ended with a divergence (0.025%)" [1] " Try running with larger adapt_delta to remove the divergences" [1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)" [1] "E-FMI indicated no pathological behavior"

I'm going to increase the iterations to 4000 with 2000 warmup to see if it cleans things up just a bit. But overall I think we're okay! I'll rerun a new muplot after the higher iterations and push when it's compiled.

lizzieinvancouver commented 5 years ago

@cchambe12 Thanks, but I need the code you ran (not the warnings).

cchambe12 commented 5 years ago

@lizzieinvancouver I'm just running models_stan_lat.R lines 6-168

cchambe12 commented 5 years ago

I ran the model 4 times, only the first one had severe mishaps so I think I had something up in my R. I increased the warmup and iterations and then created a new muplot: analyses/lat_analysis/figures/latanalysis_rmdatasets.pdf. It looks very similar to the original plot but less variation across species.

cchambe12 commented 5 years ago

@cchambe12

Regarding spann04 and spiers74, they do appear to be commercial blueberries ... so maybe we we forgot to toss out commercial blueberries in our crop code? Would be great if you could check.

@lizzieinvancouver I cleaned this up in master, should I make the change in bbculdesac as well and then rerun the main model?

lizzieinvancouver commented 5 years ago

Okay, I will probably trash my code -- it's running fine for me too (I upped warmup to 3K out of 4K iter) ... just so we have it, here is my output:

> if(use.zscore == TRUE){m2l.inter = stan('../lat_analysis/stan/winter_2level_lat.stan', data = datalist.lat,
+               iter = 4000, warmup=3000, control=list(max_treedepth = 15,adapt_delta = 0.99))}
> 
> check_all_diagnostics(m2l.inter)
[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"
> #pl<- plot(m2l.iter, pars="b_", ci.lvl=0.5) 
> launch_shinystan(m2l.inter)

Launching ShinyStan interface... for large models this  may take some time.

Listening on http://127.0.0.1:7200
> 
> m2l.inter.sum <- summary(m2l.inter)$summary
> m2l.inter.sum[grep("mu_", rownames(m2l.inter.sum)),]
                   mean    se_mean       sd       2.5%         25%       50%        75%     97.5%
mu_a_sp       27.513479 0.04861618 2.102037  23.249263  26.1363304 27.519140 28.8971971 31.551261
mu_b_force_sp -6.450839 0.02787499 1.241591  -8.829973  -7.2821422 -6.427563 -5.6690633 -3.932398
mu_b_photo_sp -1.703004 0.03628077 1.600930  -4.806581  -2.7597819 -1.764898 -0.6646035  1.509000
mu_b_chill_sp -9.087727 0.03575551 1.883463 -12.810002 -10.2985321 -9.096760 -7.8370546 -5.419063
mu_b_lat_sp   -0.857695 0.07891213 1.901191  -4.330635  -2.1232454 -1.026549  0.2870450  3.176014
mu_b_pl_sp     1.694155 0.05381354 2.018184  -2.385062   0.4137491  1.645537  2.9688960  5.744989
                  n_eff      Rhat
mu_a_sp       1869.4722 1.0017403
mu_b_force_sp 1983.9346 0.9996498
mu_b_photo_sp 1947.1149 1.0014073
mu_b_chill_sp 2774.7793 1.0004809
mu_b_lat_sp    580.4487 1.0112813
mu_b_pl_sp    1406.4962 1.0044025
> m2l.inter.sum[grep("sigma_", rownames(m2l.inter.sum)),]
                      mean     se_mean        sd      2.5%       25%       50%       75%     97.5%
sigma_a_sp        8.925233 0.034231239 1.5504495  6.293420  7.843566  8.805375  9.868062 12.255383
sigma_b_force_sp  4.701939 0.025811209 0.9875672  3.078785  4.001529  4.597984  5.285282  6.842125
sigma_b_photo_sp  5.381701 0.046114296 1.3935272  3.134519  4.414569  5.213919  6.133165  8.561501
sigma_b_chill_sp  7.589549 0.030762390 1.3971715  5.266960  6.626086  7.432849  8.369738 10.861380
sigma_b_lat_sp    3.963943 0.167054808 1.9522395  1.003804  2.538256  3.694080  5.147530  8.459794
sigma_b_pl_sp     6.169420 0.068996467 1.8467310  3.205506  4.851891  5.956756  7.214324 10.397314
sigma_y          14.964833 0.004126825 0.2970901 14.412300 14.757743 14.957816 15.164636 15.565699
                     n_eff      Rhat
sigma_a_sp       2051.4929 1.0015096
sigma_b_force_sp 1463.9176 1.0011769
sigma_b_photo_sp  913.1871 1.0020824
sigma_b_chill_sp 2062.8104 1.0020448
sigma_b_lat_sp    136.5678 1.0444707
sigma_b_pl_sp     716.3967 1.0027910
sigma_y          5182.5598 0.9997286

... and:

> if(use.zscore == FALSE){
+   m2l.inter = stan('../lat_analysis/stan/winter_2level_lat.stan', data = datalist.lat.nonz,iter = 4000, warmup=3000, control=list(max_treedepth = 12,adapt_delta = 0.99))
+ }
> 
> check_all_diagnostics(m2l.inter)
[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"
> #pl<- plot(m2l.iter, pars="b_", ci.lvl=0.5) 
> #launch_shinystan(m2l.inter)
> 
> m2l.inter.sum <- summary(m2l.inter)$summary
> m2l.inter.sum[grep("mu_", rownames(m2l.inter.sum)),]
                   mean    se_mean       sd       2.5%         25%       50%        75%     97.5%
mu_a_sp       27.513479 0.04861618 2.102037  23.249263  26.1363304 27.519140 28.8971971 31.551261
mu_b_force_sp -6.450839 0.02787499 1.241591  -8.829973  -7.2821422 -6.427563 -5.6690633 -3.932398
mu_b_photo_sp -1.703004 0.03628077 1.600930  -4.806581  -2.7597819 -1.764898 -0.6646035  1.509000
mu_b_chill_sp -9.087727 0.03575551 1.883463 -12.810002 -10.2985321 -9.096760 -7.8370546 -5.419063
mu_b_lat_sp   -0.857695 0.07891213 1.901191  -4.330635  -2.1232454 -1.026549  0.2870450  3.176014
mu_b_pl_sp     1.694155 0.05381354 2.018184  -2.385062   0.4137491  1.645537  2.9688960  5.744989
                  n_eff      Rhat
mu_a_sp       1869.4722 1.0017403
mu_b_force_sp 1983.9346 0.9996498
mu_b_photo_sp 1947.1149 1.0014073
mu_b_chill_sp 2774.7793 1.0004809
mu_b_lat_sp    580.4487 1.0112813
mu_b_pl_sp    1406.4962 1.0044025
> m2l.inter.sum[grep("sigma_", rownames(m2l.inter.sum)),]
                      mean     se_mean        sd      2.5%       25%       50%       75%     97.5%
sigma_a_sp        8.925233 0.034231239 1.5504495  6.293420  7.843566  8.805375  9.868062 12.255383
sigma_b_force_sp  4.701939 0.025811209 0.9875672  3.078785  4.001529  4.597984  5.285282  6.842125
sigma_b_photo_sp  5.381701 0.046114296 1.3935272  3.134519  4.414569  5.213919  6.133165  8.561501
sigma_b_chill_sp  7.589549 0.030762390 1.3971715  5.266960  6.626086  7.432849  8.369738 10.861380
sigma_b_lat_sp    3.963943 0.167054808 1.9522395  1.003804  2.538256  3.694080  5.147530  8.459794
sigma_b_pl_sp     6.169420 0.068996467 1.8467310  3.205506  4.851891  5.956756  7.214324 10.397314
sigma_y          14.964833 0.004126825 0.2970901 14.412300 14.757743 14.957816 15.164636 15.565699
                     n_eff      Rhat
sigma_a_sp       2051.4929 1.0015096
sigma_b_force_sp 1463.9176 1.0011769
sigma_b_photo_sp  913.1871 1.0020824
sigma_b_chill_sp 2062.8104 1.0020448
sigma_b_lat_sp    136.5678 1.0444707
sigma_b_pl_sp     716.3967 1.0027910
sigma_y          5182.5598 0.9997286

lizzieinvancouver commented 5 years ago

And info on my datalist:

> str(datalist.lat)
List of 8
 $ y    : num [1:1437] 60 60 60 60 60 60 60 60 60 60 ...
 $ chill: num [1:1437] 1.189 -0.693 0.582 -1.354 1.189 ...
 $ force: num [1:1437] 0.146 0.146 0.146 0.146 0.146 ...
 $ photo: num [1:1437] -0.293 -0.293 -0.293 -0.293 -0.293 ...
 $ lat  : num [1:1437] -1.22 -1.22 -1.22 -1.22 -1.22 ...
 $ sp   : num [1:1437] 14 14 14 14 14 14 14 14 14 14 ...
 $ N    : int 1437
 $ n_sp : int 31
>

lizzieinvancouver commented 5 years ago

@cchambe12 @AileneKane I personally would vote to switch to this model because it actually tests what we mean to test (local adaptation of photoperiod cue). If we use the model we currently have people could discount it because we don't only use provenance studies, even though actually the answer holds up.

AileneKane commented 5 years ago

Sounds good! I can add this into the new ms version! If it’s possible to easily update your model table, cat, I’d love it! Otherwise, I can do it tomorrow. Thank you!

cchambe12 commented 5 years ago

@AileneKane I'm going to update the table now unless you're working in the supp!

cchambe12 commented 5 years ago

@AileneKane Table should be all set!

AileneKane commented 5 years ago

@cchambe12 Thank you cat!

lizzieinvancouver commented 5 years ago

Just checked Heide 93a, which is multiple provenances and correctly entered!

lizzieinvancouver commented 5 years ago

I think we're good to go!

lizzieinvancouver commented 5 years ago

@AileneKane @cchambe12 Thank you both for working on this! I think it's a strong response and hope the reviewer agrees.

lizzieinvancouver / ospree

fix provenance in model #318