kevinblighe / RegParallel

Standard regression functions in R enabled for parallel processing over large data-frames.
37 stars 12 forks source link

Help needed with covariates with coxph #1

Closed hchintalapudi closed 4 years ago

hchintalapudi commented 4 years ago

Hi, Thanks for the amazing tool you developed. I wanted to use it for a Coxph model using 3200 endogenous retroviral genes (ERVs) and the samples' Survival times and vital status. The code I used: res2 <- RegParallel( data = ICGCdata_surv, formula = 'Surv(donor_survival_time, donor_vital_status) ~ [*]', FUN = function(formula, data) coxph(formula = formula, data = data, ties = 'breslow', singular.ok = TRUE), FUNtype = 'coxph', variables = colnames(ICGCdata_surv)[3:ncol(ICGCdata_surv)], blocksize = 2000, cores = 2, nestedParallel = FALSE, conflevel = 95) res2 <- res2[!is.na(res2$P),]

Now it works fine. I got the results but the HRs of all variables are pretty close other than some values which are complete anomalies. So I'm assuming there's an additive effect btw these 3220 genes that's driving this result. So I understand from one of your threads that narrowing down the gene list and building the model is appropriate (only learnt it recently) So using differentially expressed genes (Survival high vs low in our case) is more appropriate? Then also I also don't understand what the benefit of following this parallel method (other than computational speed of course) instead of running coxph separately for each gene. Because the results would obviously vary.

The other concern is we have some covariates like age, sex, tumor grade, tumor stage, tumor purity. I tried to incorporate these variables into the model and the result spiraled into 7134 variables in the result table. (I'm assuming again additive effect and combinations) Also I've noticed in your tutorial that you constructed the metadata columns but only used gene expression values for running the model. So how can I correct for these covariates in this type of RegParallel setting? Some help/insights into the matter are greatly appreciated! Thanks again,

Best, Himanshu.

hchintalapudi commented 4 years ago

I did some digging and eventually figured out how to adjust for covariates. I used this formula: formula = 'Surv(donor_survival_time, donor_vital_status) ~ [*] + Tumor.Grade + Tumor.Stage + donor_age' I have just one question: So the result file has the HR for every gene, and then every possible combination of gene and covariate and I can't possibly comprehend this. Is it safe to assume that for every gene, the first row and its HR represents the model with adjustment for all covariates? For instance:

Variable Term Beta StandardError Z P LRT Wald 1: X4300 X4300 0.008138261 2.105705e-03 3.86486333 1.111514e-04 0.0003570795 0 2: X4300 Tumor.Grade2 12.335106624 2.921869e-01 42.21650025 0.000000e+00 0.0003570795 0 3: X4300 Tumor.Grade3 13.337662623 2.966748e-01 44.95717678 0.000000e+00 0.0003570795 0 4: X4300 Tumor.Grade4 12.306427818 7.298725e-01 16.86106566 8.700607e-64 0.0003570795 0 5: X4300 Tumor.StageT1N1bMX -1.208063036 5.277504e+02 -0.00228908 9.981736e-01 0.0003570795 0

        LogRank           HR      HRlower      HRupper
1: 4.236459e-12 1.008171e+00 1.004019e+00 1.012341e+00
2: 4.236459e-12 2.275458e+05 1.283391e+05 4.034397e+05
3: 4.236459e-12 6.201165e+05 3.466915e+05 1.109183e+06
4: 4.236459e-12 2.211127e+05 5.288653e+04 9.244475e+05
5: 4.236459e-12 2.987754e-01 0.000000e+00           NA

For the above result, should I consider the first row for every gene to be the one representing all covariates and estimating survival?

Some advice is appreciated! Thank you again.

Himanshu

kevinblighe commented 4 years ago

Hey Himanshu, the model formulae for this package are essentially the same as formulae for any regression model in R. So, indeed, if I have SmokingStatusas a covariate, then I would adjust for this via:

Outcome ~ gene + SmokingStatus

In the case of RegParallel, indeed, every output line for your variable specified by [*] will already be adjusted for the covariates in your model. In your case, that's Tumor.Grade, Tumor.Stage, and _donorage. The package still outputs the results for these variables, too, but you can choose to ignore them.

There is actually an excludeTerms parameter that automatically 'greps' out any term from the output for you.

hchintalapudi commented 4 years ago

Thank you for the reply, Kevin. Eventually figured out the 'excludeTerms' and 'excludeIntercept' flags. So this is the first time I'm doing Survival Analysis using gene expression values and I'm worried if I'm doing right. I have two separate datasets whose normalized counts have been generated by DESeq2 size factors , different from estimateSizeFactors function that one typically uses in DESeq. The size factors for each sample were manually calculated using library sizes and were supplied to DESEq to get normalized counts. I used these normalized counts counts(dds, normalized =T) and got the Coxph results. But then I noticed you suggested using vst/rlog for gene expression values in Coxph analysis, in one of the Biostars threads dealing with this subject. Now I'm confused as to which one might be appropriate since I'm getting different results with using size factor normalized counts vs vst(dds, blind = T). What would you suggest?

Thank you again!

kevinblighe commented 4 years ago

Hey @hchintalapudi , my preference would be the output from vst(dds, blind = FALSE).

You will certainly notice a difference between normalised counts and vst/rlog due to the different data distributions. If you plot these via hist(), you will see what I mean.

hchintalapudi commented 4 years ago

Thanks for clearing that up Kevin. :) It was very helpful! Is there an effect between variables/genes used in * in the formula? I ask because I've been getting abnormally high HRs like hundreds or thousands in one case and when I was investigating those raw gene counts I noticed a lot of zeroes for most samples. And hence increased the filter stringency to reduce my count matrix. While doing this I've been looking at different HRs for a given gene for every run of RegParallel and I don't understand why data reduction is changing the HRs every time.

kevinblighe commented 4 years ago

RegParallel just fits the models for you via lm(), glm(), bayesglm(), or coxph() and extracts the information from those third-party functions. You would have to understand regression modeling generally in order to understand why your ORs or HRs are so inflated. If you try to fit a model to heavily-skewed data, data with outliers, data with high missingness, data with many zeros, etc., then you will always see unusual / in-credible values being returned.

To clarify, in addition:

Also, to state again that you should not be using raw or normalised counts with RegParallel. No function in RegParallel supports the data distribution on which these counts are measured, i.e., a negative binomial distribution. You need your data to be normalised and transformed to follow a normal distribution.

hchintalapudi commented 4 years ago

Thanks for the clarification. I could clearly understand that the data could be skewed or certain genes had abnormally low/high counts. I was just saying how I was eyeballing the raw counts/normalized counts to know about these genes (I used vst normalized counts as advised) I hope to understand if there was any additive effect between the genes/variables other than each individual gene being considered in the formula, and I only ask this because removing genes from the dataset is changing HR values. For instance the HR of gene A does not remain constant when I drop some other genes from the dataset and keeping the covariates same. Thanks for your time!

kevinblighe commented 4 years ago

Oh, you wanted to check additive effects of genes. That should be possible via ~ [*] + geneY + geneX. Anyway, will close this for now - please re-open if there are still issues.

hchintalapudi commented 4 years ago

Hi @kevinblighe Hope you're well! Thanks for your help earlier with RegParallel implementation and analysis. I've used vst counts to get HRs of a set of genes and I want to look at the survival of specific statistically sig genes between high and low expression samples. I see in your tutorials that you've divided them based on Z-scaling and cut-offs. Now I don't think I can do that as I'm using vst counts. Can you advise me on how to approach this issue of dividing something like vst counts to high, low and mid exp levels for survival curves please? Are tertiles/quartiles acceptable in this case?

Thank you! I would really appreciate your two cents..

kevinblighe commented 4 years ago

Hey @hchintalapudi , there are many ways to do this, such as quartiles, tertiles, or binary (divided at median). Z-scores can be used, too.

Kind regards, Kevin

hchintalapudi commented 4 years ago

So you're saying it is perfectly alright to convert vst counts to Z-scores and then encode them to high/low? I didn't know if this was right. Thanks.

kevinblighe commented 4 years ago

Yep, it's not a major issue to use Z-scores from VST

hchintalapudi commented 4 years ago

That was helpful as always. Have a good day!

hchintalapudi commented 3 years ago

Hi @kevinblighe Hope you're doing well! I just have one question regarding the output of my Regparallel analysis: Here's the formula I used:

res_Tel.vsd <- RegParallel( data = ICGCdata_Tel_vsd_surv, formula = 'Surv(donor_survival_time, donor_vital_status) ~ [*] + Tumor.Grade + AJCC.Pathology.Stage + mRNA.Bailey.Clusters.1squamous.2immunogenic.3progenitor.4ADEX + donor_age_at_diagnosis', FUN = function(formula, data) coxph(formula = formula, data = data, ties = 'breslow', singular.ok = TRUE), FUNtype = 'coxph', variables = colnames(ICGCdata_Tel_vsd_surv)[7:ncol(ICGCdata_Tel_vsd_surv)], blocksize = 1000, cores = 2, nestedParallel = FALSE, conflevel = 95, excludeTerms = c("donor_survival_time", "donor_vital_status", "Tumor Grade", "AJCC Pathology Stage", "mRNA.Bailey.Clusters.1squamous.2immunogenic.3progenitor.4ADEX", "donor_age_at_diagnosis"), excludeIntercept = T)

You said excludeTerms param is for grepping out the terms not required in the output and I followed it. But when I don't use excludeTerms by setting excludeTerms = NULL and excludeIntercept = FALSE, I get the following output.

head(res_Tel.vsd) Variable Term Beta StandardError Z P LRT Wald 1: HUERSP3_1p36.13 HUERSP3_1p36.13 -0.8153119 2.053858e-01 -3.969659506 7.197541e-05 5.649107e-05 0.007072851 2: HUERSP3_1p36.13 Tumor.Grade2 18.2627329 1.365426e+04 0.001337512 9.989328e-01 5.649107e-05 0.007072851 3: HUERSP3_1p36.13 Tumor.Grade3 18.0269023 1.365426e+04 0.001320241 9.989466e-01 5.649107e-05 0.007072851 4: HUERSP3_1p36.13 Tumor.Grade4 18.4589181 1.365426e+04 0.001351880 9.989214e-01 5.649107e-05 0.007072851 5: HUERSP3_1p36.13 AJCC.Pathology.StageIB 19.3591470 4.224167e+03 0.004582950 9.963433e-01 5.649107e-05 0.007072851 6: HUERSP3_1p36.13 AJCC.Pathology.StageIIA 19.0797743 4.224167e+03 0.004516813 9.963961e-01 5.649107e-05 0.007072851 LogRank HR HRlower HRupper 1: 0.0002697203 4.425013e-01 0.2958624 0.6618191 2: 0.0002697203 8.538943e+07 0.0000000 NA 3: 0.0002697203 6.745034e+07 0.0000000 NA 4: 0.0002697203 1.038978e+08 0.0000000 NA 5: 0.0002697203 2.556058e+08 0.0000000 NA 6: 0.0002697203 1.933039e+08 0.0000000 NA

So I compared both outputs and I see that the first row in both outputs is matching. My question is: Is the first row result still accounting for all the covariates (Tumor.Grade + AJCC.Pathology.Stage + mRNA.Bailey.Clusters.1squamous.2immunogenic.3progenitor.4ADEX + donor_age_at_diagnosis) ? Or is it only reporting the effect of the gene only and then the subsequent rows show the effect of each covariate? This question arose when I wanted to see if "donor_age_at_diagnosis" is significant in the model.

Thanks for the amazing tool again!

kevinblighe commented 3 years ago

Hey again, yes, each term is 'adjusted' for all other terms in the model, even if you use the excludeTerms parameter. Sorry that this was not clear.

So, the statistics for HUERSP3_1p36.13 are adjusted for all of these:

hchintalapudi commented 3 years ago

Thank you! That's a relief.