Closed ankushs0128 closed 4 years ago
Hi @ankushs0128 do you mind sharing the code you are using, and some details on your trajectory (e.g., number of lineages, or a visualization)? Also note that tradeSeq
does not use parallelization by default, which can be activated by setting parallel=TRUE
, and setting your parallelization details, please see the vignette on how to do this. If you did not do this, then you will have little benefit from having this high number of cores.
The messages you are sharing with us seem to actually come from Slingshot (cc @kstreet13), according to me, so being able to take a look at your code would be helpful.
The code is pretty much the same as it is shown in the vignette. The input data is merged data for 4-time points d0, d7, d13, and d20 for differentiation protocol at cell lines. We don´t know yet about the lineages. Cells are showing differentiation as expected per the time point progression from d0 to d20.
R script:
experiment.merged <- readRDS("ctrlAllDays_RNA.SCT.integrated.rds")
# Generate slingshotdataset-object using umap embedding
# Start cluster: 3 and end clusters are either 6 or 9.
sample_sds <- slingshot(Embeddings(experiment.merged, "umap"), clusterLabels = experiment.merged$seurat_clusters, start.clus = 3, end.clus = 9 )
pdf("MergedALL_UMAP.pdf", width=12, height=14, paper='special')
DimPlot(experiment.merged, reduction = 'umap', group.by = 'seurat_clusters', label=T)
dev.off()
d20Ctrl <- experiment.merged
# SCE object and Count matrix
DefaultAssay(d20Ctrl) <- "SCT" # Try integration adj data later!
sce_d20Ctrl <- as.SingleCellExperiment(d20Ctrl)
counts <- as.matrix(counts(sce_d20Ctrl))
crv <- sce_d20Ctrl
cluster_labels <- d20Ctrl$seurat_clusters
pdf("MergedALL_Colored by clusters.pdf.pdf", width=12, height=14, paper='special')
plotGeneCount(curve = sample_sds, clusters = cluster_labels,title = "Colored by clusters")
dev.off()
# Evaluate-k section
# Evaluate K based on Slingshot object
# It is generally a good idea to evaluate this multiple times using different seeds (using the set.seed function),
# to check whether the results are reproducible across different gene subsets.
### Based on Slingshot object
set.seed(1234)
icMat <- evaluateK(counts = counts, sds = sample_sds, k = 3:20, nGenes = 200, verbose = FALSE, plot= TRUE)
print(icMat[1:2, ])
### Downstream of any trajectory inference method using pseudotime and cell weights
set.seed(4321)
pseudotime <- slingPseudotime(sample_sds, na=FALSE)
cellWeights <- slingCurveWeights(sample_sds)
icMat2 <- evaluateK(counts = counts, pseudotime = pseudotime, cellWeights = cellWeights,
k=10:30, nGenes = 200, verbose = FALSE, plot = TRUE)
BPPARAM <- BiocParallel::bpparam()
BPPARAM # lists current options
## class: MulticoreParam
## bpisup: FALSE; bpnworkers: 2; bptasks: 0; bpjobname: BPJOB
## bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
## bpRNGseed: ; bptimeout: 2592000; bpprogressbar: FALSE
## bpexportglobals: TRUE
## bplogdir: NA
## bpresultdir: NA
## cluster type: FORK
BPPARAM$workers <- 64
# FitGAM
### Based on Slingshot object
DefaultAssay(d20Ctrl) <- 'integrated'
genes <- VariableFeatures(d20Ctrl)
pseudotime <- slingPseudotime(sample_sds, na = FALSE)
cellWeights <- slingCurveWeights(sample_sds)
sce <- fitGAM(counts = counts, pseudotime = pseudotime, cellWeights = cellWeights, nknots = 15, verbose = T)
sce_bckup <- sce
saveRDS(sce, file = "AFTER_FITGAM_STEP_MERGED.rds")
The RDS file is not saved so the computation time seems to be at this step
sh file:
`
#
#
#
#
echo done`
TIme since script running with this memory allocation - 15:08:18 hours
Thanks Ankush
Hi @ankushs0128
slingshot
function) fits the trajectory, so you may want to check whether what's estimated there makes biological sense.
You can visualize your reduced dimension Embeddings(experiment.merged, "umap")
using appropriate colors, and then add the trajectory using, for example, lines(sample_sds, lwd=3)
. Also just printing sample_sds
will tell you how many lineages there were estimated.In the evaluateK
step you set k=3:20
and k=10:30
. That's a lot of knots, and we do not recommend these high numbers. We typically recommend something in the range of 3 to 10, and most datasets we've looked at fall in the range of 5 to 8 knots.
As I said in my previous response, you should set parallel=TRUE
in fitGAM
and provide it with your parallelization options, if you would like to do parallelization. So your code should be sce <- fitGAM(counts = counts, sds=sample_sds, nknots = 15, verbose = T, parallel=TRUE, BPPARAM=BPPARAM)
. Again, I urge you to consider whether this high number of knots are useful. Too many knots will result in a variable fit, and they will partially explain why you have a long computation time.
Hi @koenvandenberge
I chose the knots = 15 as computed by Tradeseq and as suggested in the vignette as it was not falling in the range at 3:10. In any case, I also did what you suggested with 5 and 8 knots. But still, Tradeseq is taking long computation time with parallelization: the data set I´m using is merged 4 different timepoints. 1 Time-point data is also taking approx 12-18hrs with your suggested parameters which seems not to be optimal for the datasets. Optimal knots for this dataset is 10 knots.
Hi @ankushs0128
Please let us know how many genes, cells and lineages you are fitting. Without knowing this we cannot know whether your time is abnormal or not.
Please also see #64 for a discussion on fitGAM
runtime.
I am closing this due to inactivity. Feel free to reopen if further help would be required.
I’m running #tradeseq on approx 15k cells on remote clusters with over 72 GB of and more than 12 cores, It seems fitgam step is running over for 24 hours
The computations seem to be running after this step :
Is it possible to reduce the computational time for trajectory analysis