statdivlab / corncob

Count Regression for Correlated Observations with the Beta-binomial
103 stars 22 forks source link

application of corncob to longitudinal data #63

Closed slhogle closed 4 years ago

slhogle commented 5 years ago

Hi all,

thanks for corncob. I've enjoyed using this package so far.

I have a conceptual question - is this approach applicable to longitudinal data? If, for example, I wanted to look for differential variability in my replicate time series, is it appropriate to perform the test at every individual time point to test for this?

In your preprint you state accommodating longitudinal data could involve adding random-effects terms to the link functions for the beta-binomial model? Is this something you are planning to incorporate into corncob?

thanks again

-shane hogle

adw96 commented 5 years ago

Hey Shane -- I drafted a response to this ages ago and then forgot about it. Here is my response:

Hi Shane! Thanks for checking out the package. Bryan is a fantastic programmer and this is a really great method, IMO, so I'm glad you're enjoying using it too!

Could you let us know exactly what data structure and model you are thinking of? e.g., if you just had a single continuous response variable, what is the model you would like to fit? Once I know how you'd like to model your longitudinal data I can probably give a better answer.

We are investigating random effects models for corncob and hope to have an update for you soon. In the meantime, it's possible to treat each of your sites/replicates as fixed effects.

I hope that helps and I'm happy to follow up with more information.

Cheers,

Amy

slhogle commented 5 years ago

Hey Amy,

Thanks for following up on this. I appreciate your advice. I think I may not have explained my initial question clearly. Also I may be misunderstanding something about how the methods in corncob work (I'm at the limit of my stats knowledge here) but here goes...

We have relative abundances of species every 4 days for 60 days from 3 different 'biological replicates' for 4 experimental conditions. I want to try and say something about whether any species are differentially variable between conditions over time. So, for example, is the abundance of Taxa X more variable in the biological replicates from condition Y than in the replicates from condition Z?

From my understanding corncob can be used to answer this question at a single sampling point or snapshot in time. So is taxa X differentially variable at each time point from T(1-15)? And I could easily do this for each sampling time in my time series using corncob as it is currently implemented. My original question was if the approach of running a test at each sampled time point and then using those 15 discrete tests to say something about the entire time series is conceptually valid.

Alternatively, is there an approach with corncob that would allow you to use all the observations through time to say something about the entire time series or is that beyond the scope of the methods here?

thanks very much for any help you can give!

-shane

adw96 commented 5 years ago

Hmmmm... you can definitely interpret the results of the 15 tests together (FDR control recommended though). If you want to treat the time points as categorical variables in the mean and dispersion models, information will not be shared across the categories in estimation -- you're right about that. But if you wanted to treat time as a continuous variable, information would be shared. I don't think I'm providing you with any new information, but let me know if I am misunderstanding!

Where I typically begin with questions like this is asking you to think about what what model you would for if you were using lm() and just had a single continuous response. It's probably the case that the same model that you would use here is the model that you would use with corncob.

Is there another modeling tool that you've used that might give us an example of the type of model you have in mind? It seems like you're thinking of something cool and specific, but I don't yet understand/picture what it is!

I'm certain that you're already doing this, but make sure you do a sanity check by plotting the relative abundance of any signals you identify as interesting from the 🌽🌽 analysis!

Sorry for the vague answers and happy to help.

slhogle commented 4 years ago

Hey Amy,

I am sorry I took so long to reply. I've been away from work for awhile. Also I'm sorry for the confusing questions.

Basically what I've been using to model cell count data in the same experimental context I described above is a generalized linear mixed model and treating time and replicate as random effects as ~ day | sample_ID. I've also been using Generalized Estimating Equations with an AR1 correlation structure for this data with some success.

Anyway, I would find it pretty useful if Corncob had the capabilities to account for longitudinal data either with random effects or by specifying a within-group correlation structure.

Regardless, I've been super happy with the package in my initial testing!

thanks -shane

adw96 commented 4 years ago

Thanks so much for letting us know, Shane. I will comment again and be in touch when we have a longitudinal model up and running. It's in the works...

slhogle commented 4 years ago

Hey @adw96 ,

I am reopening this because I just noticed that the differentialTest function allows you to perform a likelihood ratio test and I am wondering if this could be an appropriate solution for a longitudinal design? Maybe posting this here will help someone else.

Say you are looking for the effect of a binary category "treatment" and you have repeated observations of the same system over time. If you expressed a differentialTest function call as:

da_analysis <- differentialTest(formula = ~ treatment + day + treatment:day, formula_null = ~ treatment + day, test = "LRT", data = myData)

This formula would model the difference between treatments at the T0, the effect of time shared by both treatments, and then basically any treatment-specific differences over time (ie the interaction term). Is null_formula the more restricted model for assessing difference in fit with the full model via the LRT?

Assuming this is the case then null_formula of~ treatment + day basically removes differences attributable to T0 treatment differences or the shared effects of time for both treatments and then taxa with p-values below your chosen type I error rate are taxa that are kind of changing abundances over time in a different way in the treatments after T0... Is that right?

This strategy is more or less what DESeq2 manual seems to recommend for time series

thank you for all the helpz! much appreciated -shane

bryandmartin commented 4 years ago

Hi @slhogle ,

null_formula can be thought of as the restricted model, yes. So here, what you are actually doing is conducting a LRT to test the difference between a model with and without a treatment/day interaction effect, controlling for the marginal effects of treatment and day on the abundance. (Note, you will want to also include something for phi.formula and phi.formula_null, even if it just specifying an intercept model).

I'm not totally clear on what you mean in your paragraph starting with "Assuming this is the case..." but I will do my best to answer and let me know if I'm missing the question. The null_formula here is what you are controlling for. If you are familiar, it is similar to how standard F-test compares nested linear models. You can think of it like your baseline model. Maybe you know that you should be controlling for treatment + day. However, you want to test whether also including an interaction effect between them explains a sufficient additional amount of the variability in your response to justify its inclusion in the model. You are doing this using a LRT to see whether the increase in log-likelihood is sufficiently different from zero.

Does this all make sense? I hope I'm answering your question, but please follow up if something is still unclear!

Bryan

slhogle commented 4 years ago

thanks @bryandmartin!

So using the above example - if we see a significant difference between a model that includes the interaction term treatment:day vs a model that doesn't is it correct to interpret a significant result as saying the data better supports a model where the abundance of strain X changes over time in a treatment dependent way?

Basically, what I was getting at is if you can use the LRT option in differentialTest in the same way that DESeq2 recommends the LRT for time series here?

http://master.bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html#time-course-experiments

thanks again!! -shane

bryandmartin commented 4 years ago

Hi @slhogle ,

Yes, if the only difference in your models is the term treatment:day, then you are correct in saying that the LRT is testing whether the data supports a model where the expectation of the abundance (or dispersion, if you use corncob and include it in the phi.formula/phi.formula_null components) of taxon X changes over time in a treatment dependent way.

Based purely on the link you sent, this is equivalent to the type of test that the DESeq2 example is conducting.

slhogle commented 4 years ago

cheers thanks so much!