statOmics / tradeSeq

TRAjectory-based Differential Expression analysis for SEQuencing data
Other
231 stars 27 forks source link

query about conditionTest significant genes #237

Open EH0234 opened 1 year ago

EH0234 commented 1 year ago

Hi TradeSeq team,

I have a dataset from slingshot with three lineages and 3 conditions. I ran fitGAM, associationTest and conditionTest as follows:

batch <- colData(sce)$orig.ident U <- model.matrix(~batch) sce <- fitGAM(sce, verbose=TRUE, genes = PSEUDOTIME_GENES$gene , conditions = factor(colData(sce)$disease), U = U) rowData(sce)$assocRes <- associationTest(sce, lineages = TRUE) RES1 <- conditionTest(sce ,global = TRUE, pairwise = TRUE, l2fc = 0, eigenThresh = 0.01, lineages = TRUE)

The conditionTest output gave the results of three comparisons per lineage as expected: pvalue_lineage1_conds1vs2, pvalue_lineage1_conds2vs3, pvalue_lineage1_conds1vs3 etc. but when I visualise the smoothed expression patterns of genes with an adjusted p value <0.05 for conds2vs3 & >=0.05 for the other comparisons (middle heatmap), it seems that these genes are more different in conds1. The conds1vs2 and conds1vs3 significant genes seem to be OK (outer heatmaps). Shown are results for lineage 1, but the other lineages show a similar phenomenon.

image

koenvandenberge commented 1 year ago

Hi @EH0234

Could you please expand on what the heatmaps are showing? What are the cells colored by and what are the different panels representing? You may be seeing unexpected results when plotting heatmaps for each condition, while also scaling the expression for each condition separately, which would effectively eliminate any mean difference between conditions. Scaling the expression over all conditions could help, if you are not doing that. If unsure about the results for specific genes, you may also want to take a look at visualizing using plotSmoothers.

EH0234 commented 1 year ago

Hi @koenvandenberge ,

Thank you for the reply. I think I figured out the issue. It seems to be due to incorrect labelling of the smoothed expression values in the predictSmooth output when tidy = FALSE. When tidy = TRUE the same values are labelled differently. See below using one gene as an example:

predictSmooth <- predictSmooth(sce, gene = "ENSG00000170458" , nPoints = 10, tidy = FALSE) predictSmooth_TIDY <- predictSmooth(sce, gene = "ENSG00000170458" , nPoints = 10, tidy = TRUE) predictSmooth <- as.data.frame(t(predictSmooth))

predictSmooth ENSG00000170458 lineage1_conditionCD_point1 1.3269963 lineage1_conditionCD_point2 3.4646428 lineage1_conditionCD_point3 7.4538497 lineage1_conditionCD_point4 10.8885394 lineage1_conditionCD_point5 9.1457271 lineage1_conditionCD_point6 6.8768231 lineage1_conditionCD_point7 4.1079422 lineage1_conditionCD_point8 1.6439890 lineage1_conditionCD_point9 0.7330838 lineage1_conditionCD_point10 0.4328705 lineage2_conditionCD_point1 0.6720436 lineage2_conditionCD_point2 1.8392069 lineage2_conditionCD_point3 4.0874128 lineage2_conditionCD_point4 5.9901264 lineage2_conditionCD_point5 4.8021526 lineage2_conditionCD_point6 2.8344472 lineage2_conditionCD_point7 1.2260979 lineage2_conditionCD_point8 0.4065247 lineage2_conditionCD_point9 0.1815357 lineage2_conditionCD_point10 0.1246889

predictSmooth_TIDY lineage time condition gene yhat 1 1 0.000000 CD ENSG00000170458 1.3269963 2 1 2.028928 CD ENSG00000170458 3.4646428 3 1 4.057855 CD ENSG00000170458 7.4538497 4 1 6.086783 CD ENSG00000170458 10.8885394 5 1 8.115711 CD ENSG00000170458 9.1457271 6 1 10.144638 CD ENSG00000170458 6.8768231 7 1 12.173566 CD ENSG00000170458 4.1079422 8 1 14.202493 CD ENSG00000170458 1.6439890 9 1 16.231421 CD ENSG00000170458 0.7330838 10 1 18.260349 CD ENSG00000170458 0.4328705 11 1 0.000000 Healthy ENSG00000170458 0.6720436 12 1 2.029447 Healthy ENSG00000170458 1.8392069 13 1 4.058894 Healthy ENSG00000170458 4.0874128 14 1 6.088341 Healthy ENSG00000170458 5.9901264 15 1 8.117788 Healthy ENSG00000170458 4.8021526 16 1 10.147235 Healthy ENSG00000170458 2.8344472 17 1 12.176682 Healthy ENSG00000170458 1.2260979 18 1 14.206130 Healthy ENSG00000170458 0.4065247 19 1 16.235577 Healthy ENSG00000170458 0.1815357 20 1 18.265024 Healthy ENSG00000170458 0.1246889

In predictSmooth, the second 10 spline points are labelled as ‘lineage2_condition CD’ whereas these same values are labelled as healthy, lineage1 in predictSmooth_TIDY.

I think predictSmooth_TIDY has the correct labels.

The heatmaps above represent smoothed expression values derived from the predictSmooth output when tidy = FALSE and should represent conds1 (CD), conds2(Healthy), conds3 (UC) based upon the labels from this output, but I think those labels are incorrect, hence the unexpected patterns when visualizing the differentially expressed genes.

Do you agree?

koenvandenberge commented 1 year ago

Thank you for sharing this, this indeed makes sense based on your output. I will look into this ASAP and get back to you.

koenvandenberge commented 9 months ago

This should now be fixed. Thanks very much for reporting this, your efforts helped us in improving the package. Feel free to reopen if you believe something is still off.

klai001 commented 9 months ago

would it be possible to retain the original names of whatever condition1vscondition2 represents in the output of conditionTest? my code chunk is as follows: fit.mod=fitGAM(counts = sce, sce=T,condition=factor(colData(sce)$group), nknots = 6, verbose = T, parallel=TRUE, BPPARAM = BPPARAM) conditionTest(fit.mod ,pairwise = TRUE, l2fc = 2, eigenThresh = 0.01, lineages = TRUE), i do not see any argument in conditionTest() which allows retaining what condition1 etc. or which group each condition represents in the output.

EH0234 commented 8 months ago

Hi @koenvandenberge ,

My code is as follows:

sce_markers <- fitGAM(sce, verbose=TRUE, conditions = factor(colData(sce)$disease), genes = genes, U = U) Fitting lineages with multiple conditions. This method has been tested on a couple of datasets, but is still in an experimental phase. |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=30s
Warning messages: 1: In asMethod(object) : sparse->dense coercion: allocating vector of size 2.9 GiB 2: In .findKnots(nknots, pseudotime, wSamp) : Impossible to place a knot at all endpoints.Increase the number of knots to avoid this issue.

ConditionTest_markers <- conditionTest(sce_markers ,global = TRUE, pairwise = TRUE, l2fc = 0, eigenThresh = 0.01, lineages = TRUE)

levels(factor(colData(sce)$disease)) [1] "CD" "Healthy" "UC"

colnames(ConditionTest_markers) [1] "waldStat" "df" "pvalue" "waldStat_lineage1_conds1vs2" [5] "df_lineage1_conds1vs2" "pvalue_lineage1_conds1vs2" "waldStat_lineage1_conds1vs3" "df_lineage1_conds1vs3"
[9] "pvalue_lineage1_conds1vs3" "waldStat_lineage1_conds2vs3" "df_lineage1_conds2vs3" "pvalue_lineage1_conds2vs3"
[13] "waldStat_lineage2_conds1vs2" "df_lineage2_conds1vs2" "pvalue_lineage2_conds1vs2" "waldStat_lineage2_conds1vs3" [17] "df_lineage2_conds1vs3" "pvalue_lineage2_conds1vs3" "waldStat_lineage2_conds2vs3" "df_lineage2_conds2vs3"
[21] "pvalue_lineage2_conds2vs3" "waldStat_lineage3_conds1vs2" "df_lineage3_conds1vs2" "pvalue_lineage3_conds1vs2"
[25] "waldStat_lineage3_conds1vs3" "df_lineage3_conds1vs3" "pvalue_lineage3_conds1vs3" "waldStat_lineage3_conds2vs3" [29] "df_lineage3_conds2vs3" "pvalue_lineage3_conds2vs3"

Does that answer your question?