Analysis constraints given design of field experiment?

kzaret / PIUV_plantation

Description, visualization, and analysis of Pilgerodendron uviferum seedling responses to field experiment treatments.

0 stars 0 forks source link

Analysis constraints given design of field experiment? #1

Open kzaret opened 2 years ago

kzaret commented 2 years ago

In March 2013, I transplanted 276 Pilgerodendron uviferum seedlings (that had been grown from cuttings) along a drainage gradient across three patches (from left to right in the photo below: Cushion, Transition, Burned Forest).

Groups of seedlings were systematically established along transects (see overview sketch below). The start of the first transect in each patch was haphazardly established; the other transects were placed at set intervals (except in the case of the two disjunct areas of the Transition patch). Seedlings were placed in groups at regular intervals along each transect. One seedling per group was established in one of seven treatment levels representing differences in naturally-occuring microsite topography and substrate type that varied by patch: Cushion -- pool or lawn; Transition -- Sphagnum magellanicum mat or other substrate; Burned Forest -- low, medium or high position (relative to the water table) on an S. magellanicum hummock. Seedlings were placed singly if microsite conditions at a given location along a transect did not include all treatment levels. Detailed diagrams and photos of each patch/treatment are located in the ExperimentalPlantation.pptx.

Repeat measurements/observations were made of each seedling across the following seasons (though not all measurements were made in each season): March 2013, March 2014, February 2016, March 2017, March 2018 and March 2019. The spreadsheet Experimental_plantation_edited.xlsx contains the data table, data dictionary and summaries of number of seedlings and number of groups per treatment.

I'm most interested in the response variables of survival, vigor, total height and substrate surface to tip. I've plotted each of these (figures pasted in .pptx, and code in Plantation.R [sorry, next time I hope to use R Markdown]).

Here are some questions related to whether/how to investigate the effect of treatment, especially, and also patch on my response variables:

Clearly I have a nested design. I have been thinking of this in terms of patch, group, treatment, seedling. However, there are also transects. I don't really expect transect to have an effect on seedling survival, growth or health; however, transects do vary in their distance along the overarching drainage gradient that runs from the Cushion patch through the Burned Forest patch. Do you think that transect needs to be included in the analysis?
It's seems like I could use a hierarchical, mixed-effects modeling approach similar to that of Salvatore Mangiafico, which accounts for the autocorrelation of repeated measures: https://rcompanion.org/handbook/I_09.html. What do you think?
What do I do about "groups" that include only a single seedling? What other imbalances in the design do I need to deal with?

ebuhle commented 2 years ago

Clearly I have a nested design. I have been thinking of this in terms of patch, group, treatment, seedling. However, there are also transects. I don't really expect transect to have an effect on seedling survival, growth or health; however, transects do vary in their distance along the overarching drainage gradient that runs from the Cushion patch through the Burned Forest patch. Do you think that transect needs to be included in the analysis?

Yes, not necessarily as a fixed effect (position along gradient) but just as a spatial random effect that, if I'm reading the map right, is similar in scale to that of seedling group. In fact, I'm inclined to see transect as the natural unit nested in patch. Your diagram above outlines the set of transects in each patch in black, but these units are totally confounded with patch type. Or almost totally -- what to do about the two Transition "units"? Ugh. No way to model those as a random effect, or as fixed unless you want to redefine patch type to have four levels. OK, here is one option (which, I am professionally obligated to say, I do not advocate for or against): pool all the transects in Transition, and deal with the spatial clustering indirectly by considering "distance along gradient" as a predictor in your candidate models.

So then it'd be patch / transect / group / seedling / observation, with microhabitat (I like that better than "treatment") contrast at the seedling level. The latter doesn't add a level to the nesting hierarchy, though, rather it's a fixed effect. However, there is a major issue with modeling the (statistical) effect of microhabitat: each patch has different microhabitats! So you fundamentally cannot distinguish the main effect of patch from the main effect of microhabitat (and obviously you can't interact them either). It's maximally unbalanced, a classic case of Hurlbertian pseudoreplication in the context of a two-factor design. If it were possible to classify microhabitats by substrate type (Sphagnum vs. non-Sphagnum, or whatever) and height relative to the water table, and to define and measure these factors consistently across patches, then this confounding could be unconfounded. Otherwise I don't know what to tell ya.

What do I do about "groups" that include only a single seedling? What other imbalances in the design do I need to deal with?

As I mentioned while we were texting, the singletons are not inherently problematic from the standpoint of hierarchical modeling / GLMMs. As long as enough groups do contain multiple seedlings, you'll be able to estimate the group-level variance component and "apply" it to the groups with N = 1. This is "borrowing strength" aka the Robin Hood principle, and is one of the main selling points of hierarchical models.

As I also said while texting, I think there could be ecological questions about whether microhabitats located near boundaries (i.e., within a seedling group) behave the same as those in "core" areas, but you could argue that including some of the latter in Transition makes the sample more representative (though only for "other substrates", I guess).

It's seems like I could use a hierarchical, mixed-effects modeling approach similar to that of Salvatore Mangiafico, which accounts for the autocorrelation of repeated measures: https://rcompanion.org/handbook/I_09.html. What do you think?

I agree! I only glanced at this, but "Repeated Measures ANOVA" is an old-school, classical name for what actually appears to be implemented there in a fairly modern, likelihood framework (namely nlme, predecessor of lme4). The fully modern, Bayesian approach as implemented in rstanarm and/or brms would just describe these as longitudinal hierarchical models, with multiple levels of grouping and yes, potentially an AR(1) error term ordered by year. (Depending on the response variable, i.e. Gaussian or count, you may need to use brms to implement the autocorrelated residuals.) Barring unforeseen complications (inconceivable!) it should be very straightforward to fit these models and do model selection, etc. If you need help, I know a guy.

ebuhle commented 2 years ago

Barring unforeseen complications (inconceivable!) it should be very straightforward to fit these models and do model selection, etc. If you need help, I know a guy.

I took a first crack at this; see /R/01_plantation_models.R. I started with total_height, because that variable and height above substrate (subsurf_to_tip) can be modeled with a normal likelihood, i.e. LMMs vs. GLMMs. Vigor is an ordinal response, so will need an appropriate model (e.g. ordered probit). With survival -- which we'll want to analyze on an incremental, annual basis -- we first need a column designating which seedlings that are newly dead were alive the last time they were observed. I don't think accounted_for or present account for this, as it were, but it was making my brain hurt so I put it aside for later.

However, an analogous issue comes into play with height. As we eventually realized in the dendro analysis, it makes more sense and is analytically simpler to model incremental growth than height. The latter is inherently mathematically autocorrelated, whereas it may well turn out that the growth rates are not empirically autocorrelated and we don't need the AR(1) errors. The tricky part is that seedlings are not observed every year, so I calculated growth rates (cm/yr) between non-missing measurements. (Sticky note: these data may provide information on the shape of the individual growth curve for seedlings, which would be relevant to your interests. For now I'm implicitly assuming growth is linear.)

I used stan_lmer() to fit some candidate models for total height growth rate and did some graphical posterior predictive checking. You can see for yourself. While there are modest differences among patches and microhabitat types, and substantial variance components associated with transect, group, and seedling, these models don't seem to fit the data terribly well. The marginal density plot shows the data are way more leptokurtic than the PPD. Not sure what's up with that. Anyway, it's a start!

kzaret commented 2 years ago

OK, here is one option (which, I am professionally obligated to say, I do not advocate for or against): pool all the transects in Transition, and deal with the spatial clustering indirectly by considering "distance along gradient" as a predictor in your candidate models.

I like that idea; I'll need to estimate 'distance along gradient' using the spatial data and add that to the .csv. You're suggesting only doing this for Transition -- does that mean creating different models for each patch?

However, there is a major issue with modeling the (statistical) effect of microhabitat: each patch has different microhabitats! So you fundamentally cannot distinguish the main effect of patch from the main effect of microhabitat (and obviously you can't interact them either). It's maximally unbalanced, a classic case of Hurlbertian pseudoreplication in the context of a two-factor design.

Lol. Right after I'd posted this issue it finally dawned on me that I was working with utterly different treatments per patch; 'microhabitats' helps highlight that fact. And the pseudopreplication, yes, sigh. I don't think things can be unconfounded. I will strive for a robust design during the next project.

I took a first crack at this; see /R/01_plantation_models.R

Thanks, Guy. ;-)

I appreciate what you wrote about the different families of models for the different response variables -- I was going to ask about that (a mild sign that I've been learning something). Modeling incremental height rather than growth makes sense to me except that I'm not confident that I consistently or accurately measured the true "total height" from one year to the next (because this involved trying to find the root collar with my finger by feel, then shoving a ruler through the substrate to that position along the stem, etc.). Talking with Andres today, it may make the most sense to examine this response variable just in terms of the total change in height from 2013 to 2019 (or even the maximum change in height across the sample period). In general, the seedlings should not be shrinking over time, though I did write a few notes about leaders being damaged (e.g. sometimes browsed). Another suggestion of Andres' was to only examine changes in height for the most robust seedlings -- those higher in vigor and undamaged.

I am much more confident in the accuracy of the measurements of seedling height above the substrate. As I mentioned by text, I think the changes in these values over time reflect incremental plant and bryophyte (esp. Sphagnum magellanicum) growth. Something about looking at these data in terms of incremental changes is rubbing me wrong, but I'm having trouble articulating why so I'll let it go for the moment, at least.

With survival -- which we'll want to analyze on an incremental, annual basis -- we first need a column designating which seedlings that are newly dead were alive the last time they were observed.

If a seedling was recorded as newly dead, then it was alive the previous time it was observed . . . except maybe for one instance when I couldn't find the seedling in the previous year (present = 0, accounted for = 0, newly dead = 0, dead = NA) and then found it dead the next year. Would it help if I created an "alive" variable?

I see that today you added models for growth rate in height above substrate. I'm about to take a look!

ebuhle commented 2 years ago

I like that idea; I'll need to estimate 'distance along gradient' using the spatial data and add that to the .csv. You're suggesting only doing this for Transition -- does that mean creating different models for each patch?

No, I'm saying do this for every transect. Of course gradient position will be strongly confounded with patch, but not completely -- there is a range of transect locations within each patch, especially Transition. The way I'd think of this is to compare the model with a fixed effect of patch to one with patch + distance. If the data favor the latter, then there's some fine-scale continuous variation along the gradient that isn't captured by the discrete patch types; conversely, controlling for such variation should help to isolate the effect of patch type.

"Distance along gradient" doesn't necessarily have to be spatial, BTW. If you have hydrological or vegetation data that more directly captures the relevant factors, that would be another option.

I don't think things can be unconfounded. I will strive for a robust design during the next project.

Actually, the previous point may be even more relevant here. Since patch and microhabitat can't be in the same model, a continuous covariate that serves as a proxy for patch type (i.e., microhabitat + distance) is the closest you can get. There will, again, still be partial confounding between the two main effects, but beggars can't be choosers.

Modeling incremental height rather than growth makes sense to me except that I'm not confident that I consistently or accurately measured the true "total height" from one year to the next (because this involved trying to find the root collar with my finger by feel, then shoving a ruler through the substrate to that position along the stem, etc.). Talking with Andres today, it may make the most sense to examine this response variable just in terms of the total change in height from 2013 to 2019 (or even the maximum change in height across the sample period).

Assuming you meant "incremental growth rather than height", this is a riff on the same conversation we had here. The height ~ b0 + b1*time and growth_rate ~ b1 formulations are the same model, and measurement error will affect both of them (namely, if it's random and not systematically biased, it just adds residual noise). The growth rate formulation is simpler; you don't need an intercept to account for initial height, or random effects on the time trend, and also it avoids the intrinsic mathematical autocorrelation that is baked into the first version.

That said, it makes sense to me that averaging the growth rate over the whole 2013-2019 period should minimize the effect of measurement error. I made some quick and dirty line plots of the incremental growth rates for each seedling, and they're all over the place.

As I mentioned by text, I think the changes in these values over time reflect incremental plant and bryophyte (esp. Sphagnum magellanicum) growth. Something about looking at these data in terms of incremental changes is rubbing me wrong, but I'm having trouble articulating why so I'll let it go for the moment, at least.

In this case the interpretation is the rate at which the seedling is winning or losing the race against its neighbors over each annual interval. Higher net growth rates on average indicate a more favorable patch / microsite.

If a seedling was recorded as newly dead, then it was alive the previous time it was observed . . . except maybe for one instance when I couldn't find the seedling in the previous year (present = 0, accounted for = 0, newly dead = 0, dead = NA) and then found it dead the next year. Would it help if I created an "alive" variable?

Right, the possibility of missing observations -- that is, the difference between "previous observation" and "previous year" -- was messing me up. There's an extra wrinkle here, in that ideally one would want the model of mortality risk to account for the length of the interval between consecutive observations. For reasons we can get into later if necessary, that's probably not doable without excessive finagling. So if it's true that there's only one such case in the whole dataset, that would be great.

I see that today you added models for growth rate in height above substrate. I'm about to take a look!

Yep, pretty much a copy-paste job from growth rate in total height. Let me know what you think!

kzaret commented 2 years ago

No, I'm saying do this for every transect. Of course gradient position will be strongly confounded with patch, but not completely -- there is a range of transect locations within each patch, especially Transition. The way I'd think of this is to compare the model with a fixed effect of patch to one with patch + distance. If the data favor the latter, then there's some fine-scale continuous variation along the gradient that isn't captured by the discrete patch types; conversely, controlling for such variation should help to isolate the effect of patch type. "Distance along gradient" doesn't necessarily have to be spatial, BTW. If you have hydrological or vegetation data that more directly captures the relevant factors, that would be another option.

As I stare at the map of the plantation and think about adding a "distance along gradient" variable, I don't think that the spatial gradient that runs west to east (or vice versa) across the entire study area is the most important gradient influencing seedling growth, vigor or mortality. I think that 'distance from the edge of the transition patch' (or, distance from the nearest edge, in the case of seedlings in the transition patch) could be important, in part due to the (potentially) reduced exposure to sun and wind offered by the trees and microtopography in that patch. For example, the seedlings at the western edge of the cushion patch are most exposed to the winds that blow inland off the fjords and across the open cushion species lawn. The seedlings on the eastern edge of the cushion patch are likely just as exposed to sun (i.e., I don't think the trees in the transition patch are tall enough to cast shadows across any of the seedlings in the cushion patch), but maybe a touch less so to wind. In the transition patch, I would think that seedlings in the interior of the patch are less exposed to the elements (the wind from the east, sun at various times of the day, and slightly buffered in terms of temperature relative to conditions in either the cushion or burned forest patches). In the burned forest patch, I think that seedlings are more "exposed" (especially to sun), the further they are from the transition edge. Does all of this make sense? (I have some temperature and relative humidity data that I still need to summarize [from about 10 ibuttons installed per patch across about seven days]. I suppose I could create two different "distance along gradient" variables and include them in separate sets of models.

ebuhle commented 2 years ago

As I stare at the map of the plantation and think about adding a "distance along gradient" variable, I don't think that the spatial gradient that runs west to east (or vice versa) across the entire study area is the most important gradient influencing seedling growth, vigor or mortality. I think that 'distance from the edge of the transition patch' (or, distance from the nearest edge, in the case of seedlings in the transition patch) could be important, in part due to the (potentially) reduced exposure to sun and wind offered by the trees and microtopography in that patch.

Hmm, OK, so transects in Cushion and Burned Forest would have positive distance values (increasing with distance from Transition edge) and those in Transition would have negative values (decreasing with distance from nearest edge)?

Your ecological rationale makes sense to me. Statistically, I'm not sure this dual-origin coordinate system (the east and west edges of Transition are both zero) accomplishes the goal of "controlling" for the non-contiguous blocks of transects in Transition because, e.g., a transect in the east block could have the same distance as one in the west block. Maybe that's not the most important thing to worry about, though.

Another consequence is that transects in Cushion and Burned Forest will have a similar range of distance, meaning that distance alone is not a proxy for patch type -- it's difficult to interpret unless patch is also in the model. And to that extent, it doesn't help at all with isolating the main effect of microhabitat.

kzaret commented 2 years ago

Regarding transects: I appreciate what you wrote about the downsides of using a dual-origin coordinate system; thus, I made a .csv with the variable 'transect distance from west to east' by group.

Regarding incremental growth (or, total change in height between 2013 and 2019): I created a "damaged" variable to track which seedlings' leaders were broken, died, etc. (i.e., when more than observer error or changing substrates might have affected my measurement of a seedling's height). I only recorded 1s in the first year when damage was observed, which ultimately may not be the most useful but it makes a start toward identifying which individuals I may want to remove from the analysis.

Regarding initial models: I'm still slowly absorbing the model output and posterior predictive checks for growth rate in height above substrate (I am still needing to refer to Gelman et al. to remember how terms are defined, etc). I wonder whether the means (?) of the data are sometimes higher than shown in the PPD due to the greater number of seedlings in the burned forest patch/transects/groups and the larger negative changes in those heights compared to the other patches/transects/etc.? When I look at the density overlay grouped by microhabitat, the data for the lawn, pool and other are most leptokurtic and change in height above substrate was smallest for those microhabitats as shown in my plot on slide 14. I wonder whether separate models should be created for each patch?

ebuhle commented 2 years ago

Continued in #2.