ankushs0128 commented 4 years ago

I’m running #tradeseq on approx 15k cells on remote clusters with over 72 GB of and more than 12 cores, It seems fitgam step is running over for 24 hours

The computations seem to be running after this step :

Using full covariance matrix Warning message: n if (class(X) == "dist") X <- as.matrix(X) : the condition has length > 1 and only the first element will be used Warning message: Usingas.character()on a quosure is deprecated as of rlang 0.3.0. Please useas_label()oras_name()` instead. This warning is displayed once per session. null device 1 null device 1

Is it possible to reduce the computational time for trajectory analysis

koenvandenberge commented 4 years ago

Hi @ankushs0128 do you mind sharing the code you are using, and some details on your trajectory (e.g., number of lineages, or a visualization)? Also note that tradeSeq does not use parallelization by default, which can be activated by setting parallel=TRUE, and setting your parallelization details, please see the vignette on how to do this. If you did not do this, then you will have little benefit from having this high number of cores.

The messages you are sharing with us seem to actually come from Slingshot (cc @kstreet13), according to me, so being able to take a look at your code would be helpful.

ankushs0128 commented 4 years ago

The code is pretty much the same as it is shown in the vignette. The input data is merged data for 4-time points d0, d7, d13, and d20 for differentiation protocol at cell lines. We don´t know yet about the lineages. Cells are showing differentiation as expected per the time point progression from d0 to d20.

R script:

experiment.merged <- readRDS("ctrlAllDays_RNA.SCT.integrated.rds")

# Generate slingshotdataset-object using umap embedding
# Start cluster: 3 and end clusters are either 6 or 9. 
sample_sds <- slingshot(Embeddings(experiment.merged, "umap"), clusterLabels = experiment.merged$seurat_clusters, start.clus = 3, end.clus = 9 )
pdf("MergedALL_UMAP.pdf", width=12, height=14, paper='special')
DimPlot(experiment.merged, reduction = 'umap', group.by = 'seurat_clusters', label=T)
dev.off() 
 d20Ctrl <- experiment.merged
# SCE object and Count matrix
DefaultAssay(d20Ctrl) <- "SCT" # Try integration adj data later!
sce_d20Ctrl <- as.SingleCellExperiment(d20Ctrl)
counts <- as.matrix(counts(sce_d20Ctrl))
crv <- sce_d20Ctrl
cluster_labels <- d20Ctrl$seurat_clusters

pdf("MergedALL_Colored by clusters.pdf.pdf", width=12, height=14, paper='special')
plotGeneCount(curve = sample_sds, clusters = cluster_labels,title = "Colored by clusters")
dev.off()
# Evaluate-k section
  # Evaluate K based on Slingshot object
  # It is generally a good idea to evaluate this multiple times using different seeds (using the set.seed function), 
  # to check whether the results are reproducible across different gene subsets.

  ### Based on Slingshot object
  set.seed(1234)
  icMat <- evaluateK(counts = counts, sds = sample_sds, k = 3:20, nGenes = 200, verbose = FALSE, plot= TRUE)
  print(icMat[1:2, ])

 ### Downstream of any trajectory inference method using pseudotime and cell weights
  set.seed(4321)
  pseudotime <- slingPseudotime(sample_sds, na=FALSE)
  cellWeights <- slingCurveWeights(sample_sds)
  icMat2 <- evaluateK(counts = counts, pseudotime = pseudotime, cellWeights = cellWeights,
                   k=10:30, nGenes = 200, verbose = FALSE, plot = TRUE)

BPPARAM <- BiocParallel::bpparam()
BPPARAM # lists current options
## class: MulticoreParam
##   bpisup: FALSE; bpnworkers: 2; bptasks: 0; bpjobname: BPJOB
##   bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
##   bpRNGseed: ; bptimeout: 2592000; bpprogressbar: FALSE
##   bpexportglobals: TRUE
##   bplogdir: NA
##   bpresultdir: NA
##   cluster type: FORK
BPPARAM$workers <- 64 

# FitGAM
  ### Based on Slingshot object
  DefaultAssay(d20Ctrl) <- 'integrated'
  genes <- VariableFeatures(d20Ctrl)
  pseudotime <- slingPseudotime(sample_sds, na = FALSE)
  cellWeights <- slingCurveWeights(sample_sds)
  sce <- fitGAM(counts = counts, pseudotime = pseudotime, cellWeights = cellWeights, nknots = 15, verbose = T)

sce_bckup <- sce
saveRDS(sce, file = "AFTER_FITGAM_STEP_MERGED.rds")

The RDS file is not saved so the computation time seems to be at this step

sh file:
`

Job name:

SBATCH --job-name=64core-tradeseq

#

Project:

SBATCH --account=Name of project

#

Wall clock limit:

SBATCH --time=42:00:00

#

Max memory usage: Size is a number plus M (megabyte) or G (gigabyte), e.g., 3M or 5G

SBATCH --mem-per-cpu=10G

#

Number of tasks (cores): this is added to make it easier for you to do exercises

SBATCH --ntasks=64

SBATCH --partition=bigmem

TASK STATUS EMAILS

SBATCH --mail-user=

SBATCH --mail-type=FAIL

SBATCH --mail-type=END

echo done`

TIme since script running with this memory allocation - 15:08:18 hours

Thanks Ankush

koenvandenberge commented 4 years ago

Hi @ankushs0128

Note that you mention that you don't know about the lineages, but your second line of actual code (the slingshot function) fits the trajectory, so you may want to check whether what's estimated there makes biological sense. You can visualize your reduced dimension Embeddings(experiment.merged, "umap") using appropriate colors, and then add the trajectory using, for example, lines(sample_sds, lwd=3). Also just printing sample_sds will tell you how many lineages there were estimated.

In the evaluateK step you set k=3:20 and k=10:30. That's a lot of knots, and we do not recommend these high numbers. We typically recommend something in the range of 3 to 10, and most datasets we've looked at fall in the range of 5 to 8 knots.
As I said in my previous response, you should set parallel=TRUE in fitGAM and provide it with your parallelization options, if you would like to do parallelization. So your code should be sce <- fitGAM(counts = counts, sds=sample_sds, nknots = 15, verbose = T, parallel=TRUE, BPPARAM=BPPARAM). Again, I urge you to consider whether this high number of knots are useful. Too many knots will result in a variable fit, and they will partially explain why you have a long computation time.

ankushs0128 commented 4 years ago

Hi @koenvandenberge

I chose the knots = 15 as computed by Tradeseq and as suggested in the vignette as it was not falling in the range at 3:10. In any case, I also did what you suggested with 5 and 8 knots. But still, Tradeseq is taking long computation time with parallelization: the data set I´m using is merged 4 different timepoints. 1 Time-point data is also taking approx 12-18hrs with your suggested parameters which seems not to be optimal for the datasets. Optimal knots for this dataset is 10 knots.

koenvandenberge commented 4 years ago

Hi @ankushs0128

Please let us know how many genes, cells and lineages you are fitting. Without knowing this we cannot know whether your time is abnormal or not. Please also see #64 for a discussion on fitGAM runtime.

koenvandenberge commented 4 years ago

I am closing this due to inactivity. Feel free to reopen if further help would be required.

statOmics / tradeSeq

tradeSeq #60

Job name:

SBATCH --job-name=64core-tradeseq

Project:

SBATCH --account=Name of project

Wall clock limit:

SBATCH --time=42:00:00

Max memory usage: Size is a number plus M (megabyte) or G (gigabyte), e.g., 3M or 5G

SBATCH --mem-per-cpu=10G

Number of tasks (cores): this is added to make it easier for you to do exercises

SBATCH --ntasks=64

SBATCH --partition=bigmem

TASK STATUS EMAILS

SBATCH --mail-user=

SBATCH --mail-type=FAIL

SBATCH --mail-type=END