statOmics / tradeSeq

TRAjectory-based Differential Expression analysis for SEQuencing data
Other
232 stars 27 forks source link

how is the end points defined for diffEndTest() #155

Open liuweihanty opened 3 years ago

liuweihanty commented 3 years ago

Hi there, I have a question which is still unclear after reading the 2020 nature communication paper. So for diffEndTest() function, how does it define the "end points" for each lineage to compare? For example, how many cells does it choose from the terminus of each lineage and how does it choose these cells?(eg just chose the cells with largest pseudotime values?) Thank you so much!

HectorRDB commented 3 years ago

Hi, The endpoints are the endpoints of the lineages, i.e., they do not represent any cells. You can see them as the average cell with the maximum pseudotime value in that lineage. If you look at the image below from the slinghshot paper

the end of the lineages in a or b are the end of the black curves.

Let me know if that is unclear

Aintzane99 commented 9 months ago

Hello @HectorRDB,

Just wanted to reopen this issue as I wondering the same thing as @liuweihanty. I ran fitGAM() on a subset of 3 trajectories (3,4 and 5) from the output sce from slingshot. Before running the diffEndTest() function I wanted to understand which "end points" would be taken for the comparison.

Considering your explanation "You can see them as the average cell with the maximum pseudotime value in that lineage." , I wanted to take a look at the pseudotime values for each cell in lineages 3, 4 and 5 in my fitGAM() output sce.

PT_ftigam <-  sceFitGAM$crv %>% as.data.frame() 
PT_ftigam

image

Then I wanted to find how many cells shared the highest pseudotime value for each lineage, as I understood them to be the "end points". But for all 3, 4 and 5 lineages I could only find one cell with the maximum pseudotime value (example for lineage 5 below).

PT_ftigam %>% filter(pseudotime.Lineage5==max(pseudotime.Lineage5, na.rm = T))

                        pseudotime.Lineage3 pseudotime.Lineage4 pseudotime.Lineage5 cellWeights.Lineage3 cellWeights.Lineage4 cellWeights.Lineage5
C_LM_AGGAAATTCAAGGAGC-1           0.7925906           0.8041555            1.724516                    0                    0                    1

Certainly, I must be missing something here. If you could elaborate a little bit more on how the "end points" are chosen It would be much appreciated.

Many thanks,

Aintzane

HectorRDB commented 9 months ago

Hi @Aintzane99 that is indeed unexpected. Could you share a picture of the low dim represention of your data with the slingshot curves? Best,

Aintzane99 commented 9 months ago

Hello @HectorRDB, thank you so much for the quick reply. I will share a little bit more of the code as well as the plots to try to understand better what might be the issue:

# run slingshot on the 1:49 top PCs  on a sce object with 32285 genes and  26851 cells
reducedDim(allClusters_slingSCE, "corrected1to49") <- reducedDim(allClusters_slingSCE, "corrected_integrated")[, 1:49]
allClusters_slingSCE<- slingshot(allClusters_slingSCE, clusterLabels = 'cluster.nn.80.corrected', reducedDim = 'corrected1to49', approx_points=100, omega= TRUE)

# plot lineage structure
plot(reducedDims(allClusters_slingSCE)$corrected1to49, 
     col =myColors,
     pch=16, asp = 1)
lines(SlingshotDataSet(allClusters_slingSCE), lwd=2, type = 'lineages', col = 'black')

image

# visualize the lineages on a different coordinate space (UMAP) than the one in which they were constructed (corrected1to49)
embedded <- embedCurves(allClusters_slingSCE, "UMAP_integrated")
embedded <- slingCurves(embedded)

gg <- plotReducedDim(allClusters_slingSCE, colour_by="cluster.nn.80.corrected", dimred="UMAP_integrated")
gg
for (traject in 1:length(embedded) ) { 

  path= embedded[[traject]]
  x <- data.frame(path$s[path$ord,])

  gg <- gg + geom_path(data=x, aes(x=Dim.1, y=Dim.2), size=1.2)
}
gg

image

Then, we wanted to run tradeseq to analyse the gene expression along trajectories, but we were only interested on specific trajectories: image

In order to do that, we subsetted the SCE object to only contain cells with a non-zero weight in at least one lineage regarding 3, 4 and 5 :

pseudotime <- slingPseudotime(allClusters_slingSCE, na = FALSE)
cellWeights <- slingCurveWeights(allClusters_slingSCE)

# remove the cell weights that are 0 from the pseudotime and cellWeights matrices
cWsubset= cellWeights[  rowSums(cellWeights[ , c(3:5)]) !=0, c(3:5)] 
table( rowSums(cWsubset) !=0 )
pTime_subset = pseudotime[ rownames(cWsubset) , c(3:5)]

# run evaluateK on the subsetted cells and the top 2000 most variable genes in the subset of cells
counts <- assay(allClusters_slingSCE, "counts")

# calculate which are the highly variable genes
top2000_hvg_sb <- scran::getTopHVGs(allClusters_slingSCE[ ,rownames(cWsubset)], n=2000)

# subset the counts matrix to only contain those 2000 genes to run evaluateK
counts_2000_hvg <- counts[ top2000_hvg_sb , ]

# run evaluateK on the subset of cells and the top 2000 HVG
library(doParallel)
registerDoParallel(2)
BPPARAM <- BiocParallel::DoparParam()
set.seed(5) 
icMat2 <- evaluateK(counts = counts_2000_hvg[, rownames(cWsubset)], pseudotime = pTime_subset, cellWeights = cWsubset,
                    k=3:15, nGenes = 500, verbose = TRUE, plot = TRUE, parallel=TRUE, BPPARAM = BPPARAM )

plot results from evaluateK. We decided to go with 8 knots: image

Then, we run fitGAM on the subset of cells and the top 2000 HVGs:

set.seed(6)
sce <- fitGAM(counts = counts[, rownames(cWsubset)] ,
              pseudotime = pTime_subset, cellWeights = cWsubset, nknots = 8,
              verbose = TRUE, parallel=TRUE, BPPARAM = BPPARAM,
              genes= rownames(counts[, rownames(cWsubset)])[rownames(counts[, rownames(cWsubset)]) %in% top2000_hvg_sb] )

I hope that it was clear enough. Thank you so much!

Aintzane

HectorRDB commented 9 months ago

Hi, I assume that lineage 5 is the lineage that goes through cluster 10 on the plot above. As you can see, at least in the umap space, that lineage is really hyperextended and really "overshoot" cluster 10. I would recommand using fewer pcs to build the lineages, my guess is it, given your dataset, 50 pcs is too many and you have a few cells from cluster 10 really separated from the rest. Best,

Aintzane99 commented 8 months ago

Hi,

Thank you so much for the suggestion. We have recomputed the trajectories using fewer PCs and the curves on the UMAP look much better. However, we are still encountering the same issue regarding the maximum pseudotime value regarding that there seems to be no shared maximum pseudotime value between cells:

slingPseudotime(slingDF) %>% as.data.frame() %>% pull(Lineage3) %>% as.character() %>% sort(decreasing=TRUE) %>% head()
# "1.44170042069813" "1.43886170190371" "1.42711708624991" "1.4240670867249"  "1.42275394715436" "1.42213034771702"

Does the diffEndTest() function round in some way the pseudotime values for the cells and then compares the cells with the maximum rounded pseudotime value from each trajectory?

Thanks so much in advance :)

Aintzane

HectorRDB commented 8 months ago

Hi, Thanks for the feedback. Can you try when removing the as.character() part?

Aintzane99 commented 8 months ago

Yes! Sure, the same line of code without the as.character():

slingPseudotime(slingDF) %>% as.data.frame() %>% pull(Lineage3) %>% sort(decreasing=TRUE) %>% head()
[1] 1.441700 1.438862 1.427117 1.424067 1.422754 1.422130

Thank you so much :)

Aintzane