Closed hchintalapudi closed 4 years ago
I did some digging and eventually figured out how to adjust for covariates. I used this formula: formula = 'Surv(donor_survival_time, donor_vital_status) ~ [*] + Tumor.Grade + Tumor.Stage + donor_age' I have just one question: So the result file has the HR for every gene, and then every possible combination of gene and covariate and I can't possibly comprehend this. Is it safe to assume that for every gene, the first row and its HR represents the model with adjustment for all covariates? For instance:
Variable Term Beta StandardError Z P LRT Wald 1: X4300 X4300 0.008138261 2.105705e-03 3.86486333 1.111514e-04 0.0003570795 0 2: X4300 Tumor.Grade2 12.335106624 2.921869e-01 42.21650025 0.000000e+00 0.0003570795 0 3: X4300 Tumor.Grade3 13.337662623 2.966748e-01 44.95717678 0.000000e+00 0.0003570795 0 4: X4300 Tumor.Grade4 12.306427818 7.298725e-01 16.86106566 8.700607e-64 0.0003570795 0 5: X4300 Tumor.StageT1N1bMX -1.208063036 5.277504e+02 -0.00228908 9.981736e-01 0.0003570795 0
LogRank HR HRlower HRupper
1: 4.236459e-12 1.008171e+00 1.004019e+00 1.012341e+00
2: 4.236459e-12 2.275458e+05 1.283391e+05 4.034397e+05
3: 4.236459e-12 6.201165e+05 3.466915e+05 1.109183e+06
4: 4.236459e-12 2.211127e+05 5.288653e+04 9.244475e+05
5: 4.236459e-12 2.987754e-01 0.000000e+00 NA
For the above result, should I consider the first row for every gene to be the one representing all covariates and estimating survival?
Some advice is appreciated! Thank you again.
Himanshu
Hey Himanshu, the model formulae for this package are essentially the same as formulae for any regression model in R. So, indeed, if I have SmokingStatus
as a covariate, then I would adjust for this via:
Outcome ~ gene + SmokingStatus
In the case of RegParallel, indeed, every output line for your variable specified by [*]
will already be adjusted for the covariates in your model. In your case, that's Tumor.Grade, Tumor.Stage, and _donorage. The package still outputs the results for these variables, too, but you can choose to ignore them.
There is actually an excludeTerms
parameter that automatically 'greps' out any term from the output for you.
Thank you for the reply, Kevin. Eventually figured out the 'excludeTerms' and 'excludeIntercept' flags. So this is the first time I'm doing Survival Analysis using gene expression values and I'm worried if I'm doing right. I have two separate datasets whose normalized counts have been generated by DESeq2 size factors , different from estimateSizeFactors function that one typically uses in DESeq. The size factors for each sample were manually calculated using library sizes and were supplied to DESEq to get normalized counts. I used these normalized counts counts(dds, normalized =T) and got the Coxph results. But then I noticed you suggested using vst/rlog for gene expression values in Coxph analysis, in one of the Biostars threads dealing with this subject. Now I'm confused as to which one might be appropriate since I'm getting different results with using size factor normalized counts vs vst(dds, blind = T). What would you suggest?
Thank you again!
Hey @hchintalapudi , my preference would be the output from vst(dds, blind = FALSE)
.
You will certainly notice a difference between normalised counts and vst/rlog due to the different data distributions. If you plot these via hist()
, you will see what I mean.
Thanks for clearing that up Kevin. :) It was very helpful! Is there an effect between variables/genes used in * in the formula? I ask because I've been getting abnormally high HRs like hundreds or thousands in one case and when I was investigating those raw gene counts I noticed a lot of zeroes for most samples. And hence increased the filter stringency to reduce my count matrix. While doing this I've been looking at different HRs for a given gene for every run of RegParallel and I don't understand why data reduction is changing the HRs every time.
RegParallel just fits the models for you via lm()
, glm()
, bayesglm()
, or coxph()
and extracts the information from those third-party functions. You would have to understand regression modeling generally in order to understand why your ORs or HRs are so inflated. If you try to fit a model to heavily-skewed data, data with outliers, data with high missingness, data with many zeros, etc., then you will always see unusual / in-credible values being returned.
To clarify, in addition:
~ [*]
== no adjustment for anything. The values that are returned for your variables defined by [*] have no adjustment~ [*] + BMI = SmokingStatus
== The values that are returned for your variables defined by [*] have been adjusted for BMI and SmokingStatusAlso, to state again that you should not be using raw or normalised counts with RegParallel. No function in RegParallel supports the data distribution on which these counts are measured, i.e., a negative binomial distribution. You need your data to be normalised and transformed to follow a normal distribution.
Thanks for the clarification. I could clearly understand that the data could be skewed or certain genes had abnormally low/high counts. I was just saying how I was eyeballing the raw counts/normalized counts to know about these genes (I used vst normalized counts as advised) I hope to understand if there was any additive effect between the genes/variables other than each individual gene being considered in the formula, and I only ask this because removing genes from the dataset is changing HR values. For instance the HR of gene A does not remain constant when I drop some other genes from the dataset and keeping the covariates same. Thanks for your time!
Oh, you wanted to check additive effects of genes. That should be possible via ~ [*] + geneY + geneX
. Anyway, will close this for now - please re-open if there are still issues.
Hi @kevinblighe Hope you're well! Thanks for your help earlier with RegParallel implementation and analysis. I've used vst counts to get HRs of a set of genes and I want to look at the survival of specific statistically sig genes between high and low expression samples. I see in your tutorials that you've divided them based on Z-scaling and cut-offs. Now I don't think I can do that as I'm using vst counts. Can you advise me on how to approach this issue of dividing something like vst counts to high, low and mid exp levels for survival curves please? Are tertiles/quartiles acceptable in this case?
Thank you! I would really appreciate your two cents..
Hey @hchintalapudi , there are many ways to do this, such as quartiles, tertiles, or binary (divided at median). Z-scores can be used, too.
Kind regards, Kevin
So you're saying it is perfectly alright to convert vst counts to Z-scores and then encode them to high/low? I didn't know if this was right. Thanks.
Yep, it's not a major issue to use Z-scores from VST
That was helpful as always. Have a good day!
Hi @kevinblighe Hope you're doing well! I just have one question regarding the output of my Regparallel analysis: Here's the formula I used:
res_Tel.vsd <- RegParallel( data = ICGCdata_Tel_vsd_surv, formula = 'Surv(donor_survival_time, donor_vital_status) ~ [*] + Tumor.Grade + AJCC.Pathology.Stage + mRNA.Bailey.Clusters.1squamous.2immunogenic.3progenitor.4ADEX + donor_age_at_diagnosis', FUN = function(formula, data) coxph(formula = formula, data = data, ties = 'breslow', singular.ok = TRUE), FUNtype = 'coxph', variables = colnames(ICGCdata_Tel_vsd_surv)[7:ncol(ICGCdata_Tel_vsd_surv)], blocksize = 1000, cores = 2, nestedParallel = FALSE, conflevel = 95, excludeTerms = c("donor_survival_time", "donor_vital_status", "Tumor Grade", "AJCC Pathology Stage", "mRNA.Bailey.Clusters.1squamous.2immunogenic.3progenitor.4ADEX", "donor_age_at_diagnosis"), excludeIntercept = T)
You said excludeTerms param is for grepping out the terms not required in the output and I followed it. But when I don't use excludeTerms by setting excludeTerms = NULL and excludeIntercept = FALSE, I get the following output.
head(res_Tel.vsd) Variable Term Beta StandardError Z P LRT Wald 1: HUERSP3_1p36.13 HUERSP3_1p36.13 -0.8153119 2.053858e-01 -3.969659506 7.197541e-05 5.649107e-05 0.007072851 2: HUERSP3_1p36.13 Tumor.Grade2 18.2627329 1.365426e+04 0.001337512 9.989328e-01 5.649107e-05 0.007072851 3: HUERSP3_1p36.13 Tumor.Grade3 18.0269023 1.365426e+04 0.001320241 9.989466e-01 5.649107e-05 0.007072851 4: HUERSP3_1p36.13 Tumor.Grade4 18.4589181 1.365426e+04 0.001351880 9.989214e-01 5.649107e-05 0.007072851 5: HUERSP3_1p36.13 AJCC.Pathology.StageIB 19.3591470 4.224167e+03 0.004582950 9.963433e-01 5.649107e-05 0.007072851 6: HUERSP3_1p36.13 AJCC.Pathology.StageIIA 19.0797743 4.224167e+03 0.004516813 9.963961e-01 5.649107e-05 0.007072851 LogRank HR HRlower HRupper 1: 0.0002697203 4.425013e-01 0.2958624 0.6618191 2: 0.0002697203 8.538943e+07 0.0000000 NA 3: 0.0002697203 6.745034e+07 0.0000000 NA 4: 0.0002697203 1.038978e+08 0.0000000 NA 5: 0.0002697203 2.556058e+08 0.0000000 NA 6: 0.0002697203 1.933039e+08 0.0000000 NA
So I compared both outputs and I see that the first row in both outputs is matching. My question is: Is the first row result still accounting for all the covariates (Tumor.Grade + AJCC.Pathology.Stage + mRNA.Bailey.Clusters.1squamous.2immunogenic.3progenitor.4ADEX + donor_age_at_diagnosis) ? Or is it only reporting the effect of the gene only and then the subsequent rows show the effect of each covariate? This question arose when I wanted to see if "donor_age_at_diagnosis" is significant in the model.
Thanks for the amazing tool again!
Hey again, yes, each term is 'adjusted' for all other terms in the model, even if you use the excludeTerms
parameter. Sorry that this was not clear.
So, the statistics for HUERSP3_1p36.13 are adjusted for all of these:
Thank you! That's a relief.
Hi, Thanks for the amazing tool you developed. I wanted to use it for a Coxph model using 3200 endogenous retroviral genes (ERVs) and the samples' Survival times and vital status. The code I used: res2 <- RegParallel( data = ICGCdata_surv, formula = 'Surv(donor_survival_time, donor_vital_status) ~ [*]', FUN = function(formula, data) coxph(formula = formula, data = data, ties = 'breslow', singular.ok = TRUE), FUNtype = 'coxph', variables = colnames(ICGCdata_surv)[3:ncol(ICGCdata_surv)], blocksize = 2000, cores = 2, nestedParallel = FALSE, conflevel = 95) res2 <- res2[!is.na(res2$P),]
Now it works fine. I got the results but the HRs of all variables are pretty close other than some values which are complete anomalies. So I'm assuming there's an additive effect btw these 3220 genes that's driving this result. So I understand from one of your threads that narrowing down the gene list and building the model is appropriate (only learnt it recently) So using differentially expressed genes (Survival high vs low in our case) is more appropriate? Then also I also don't understand what the benefit of following this parallel method (other than computational speed of course) instead of running coxph separately for each gene. Because the results would obviously vary.
The other concern is we have some covariates like age, sex, tumor grade, tumor stage, tumor purity. I tried to incorporate these variables into the model and the result spiraled into 7134 variables in the result table. (I'm assuming again additive effect and combinations) Also I've noticed in your tutorial that you constructed the metadata columns but only used gene expression values for running the model. So how can I correct for these covariates in this type of RegParallel setting? Some help/insights into the matter are greatly appreciated! Thanks again,
Best, Himanshu.