statOmics / tradeSeq

TRAjectory-based Differential Expression analysis for SEQuencing data
Other
238 stars 27 forks source link

associationTest gives NA for well-expressed genes #209

Open pavsol opened 2 years ago

pavsol commented 2 years ago

Hi, I am experiencing an issue with NA values coming from associationTest() similar as mentioned in #117. Only meanLogFC values are not NA. I am running an association test without any gene limitation, thus, there are also sparsely expressed genes and I can expect some NAs. However, I got NA values even for well-expressed genes that are obviously associated with pseudotime progression which we can also say based on prior knowledge. I get 13,749 NA genes out of 24,090. My dataset contains ~1000 cells and only a single lineage. See attached figures.

My pipeline in brief:

seurat <- readRDS("seurat.RDS")
sce <- as.SingleCellExperiment(seurat, assay = "RNA")
sce <- slingshot(sce, reducedDim = "UMAP", clusterLabels = colData(sce)$seurat_clusters, start.clus = 5, approx_points = 300)
sce <- fitGAM(sce, nknots = 6)
ATres <- associationTest(sce)

tradeSeq version: 1.7.07 R version: 4.1.3

Pseudotime (slingshot) pt

Expression of one of genes with NA values ex

I will greatly appreciate any help as those NA genes are unfortunately quite essential.

Thank you, Pavel

gomeznick86 commented 2 years ago

Hey @pavsol , I'm running into the same problem. Reducing the number of nPoints (2 * 6 knots =12 , to 10) seemed to significantly reduce the number of NAs generated. I'm not exactly sure why this is the case? Maybe sampling less is better? Hopefully one of the Devs has a much more statistically sound reason.

pavsol commented 2 years ago

Hi @gomeznick86 Thank you for sharing this observation. I re-fitted data using 5 knots which seem enough for my simple lineage. Then, the association test gave me only 3599 NAs with default 10 nPoints and only 237 NAs using 8 nPoints.

SierraN commented 1 year ago

Hello,

I'm having a similar issue with the fitGAM output contain exceedingly large amounts of NA's. I'm new to this package and I'm wondering if perhaps one of my arguments is incorrect. I start with slingshot, which identifies 5 lineages. I indicated a starting cluster, and UMAP dimension reduction. Then, for the fitGAM, I condition on case status (2 conditions) and use 8 knots. I've been parallelizing on a server over 30 workers, though it still takes an exceedingly long time to run. In the full data, I have a total of 27576 genes and 136641 cells, with 65908 cells in one group and 70733 cells in the other. I have run fitGAM with the full data and with a subset of genes that have at least 3 counts in more than 5% of the cells (1347). For the global p-value, only 2 of the 1347 high expressed genes are not NA and are recorded as 0. For the lineage/condition-specific p-values, most have about 650-700 non-NA values, though the 5th lineage has about 1000 non-nas and 1 lineage/condition has only 327 non-NAs. Overall, the Wald stats/p-values that are not NAs are then exceedingly high/low (respectively), and nearly all significant even after BH adjustment. These seem like questionable results. Do you have any suggestions on how to improve the estimation with fitGAM? And possibly run time too? I think the last run took about 8 hours.

dvaldes2 commented 1 year ago

Hi there, I am having the same issues discussed in the main question (associationTest gives NA for well-expressed genes). Do you have any advice as to why it is like this and if it will be resolved? Thanks!

Bberenicedug commented 1 year ago

Hi, I am also encoutering these issues. Interested if anyone has an advice :) Many thanks!

pavsol commented 1 year ago

Hi @dvaldes2 and @Bberenicedug as mentioned above, the solution for me was to decrease nknots for FitGAM function and also reduce Points in associationTest.

koenvandenberge commented 1 year ago

Hi @pavsol @gomeznick86 @SierraN @dvaldes2 @Bberenicedug ,

My apologies for being silent for so long, but I first wanted to look into this before responding. Thanks everyone for reporting this. This has been an issue for a while and we should have looked into this sooner. While reducing nPoints may be beneficial in some cases, there may also be other solutions.

The NA values returned by associationTest are a result of not being able to invert the variance-covariance matrix of the contrasts. With a disclaimer that, for now, I have mainly been looking at situations where l2fc=0, there are two ways we may find a work-around:

From my current exploration, it seems like contrastType="end" (comparing the end point of a lineage to points upstream in pseudotime) and using the inverse="Chol" (use Cholesky decomposition to invert the variance-covariance matrix) seems to work well and results in a lower amount of NA values. EDIT Also using inverse="eigen" seems to work well irrespective of contrastType.

For now, it would be really useful to me if you could try this (setting inverse="eigen" or contrastType="end" AND inverse="Chol") and let me know how that's working out for you, and which ones are (not). I have yet to look into situations where l2fc!=0 and am planning to do so soon.

DavoSam commented 1 year ago

@koenvandenberge I am currently experiencing this issue as well and want to test these recommendations. However im having a hard time finding where the inverse argument is. Thanks!

flde commented 1 year ago

Dear @koenvandenberge,

I had the same issue (one lineage, two condition) and tried the settings you suggested including default:

option_1: contrastType="start", inverse="Chol" option_2: contrastType="end", inverse="Chol" option_3: contrastType="start", inverse="eigen"

After removing NAs the settings yield option_1=3890, option_2=10463, and option_3=6606 genes. I am working on an erythroid lineage and checked Epor which should be associated with the PT in both condition. That is only the case for option_2. For option_1 it is not associated. and for option_3 only in one condition.

I attached an csv with the Epor association result for all three options. Many thanks for this thread and please let me know if I can do mor testing. I go with option_2 for now.

Best, Florian

epor.csv download

@DavoSam you probably need to install the github version of tradeseq the parameter are part of associationTest