statOmics / tradeSeq

TRAjectory-based Differential Expression analysis for SEQuencing data
Other
237 stars 27 forks source link

tradeSeq #60

Closed ankushs0128 closed 4 years ago

ankushs0128 commented 4 years ago

I’m running #tradeseq on approx 15k cells on remote clusters with over 72 GB of and more than 12 cores, It seems fitgam step is running over for 24 hours

The computations seem to be running after this step :

Using full covariance matrix Warning message: n if (class(X) == "dist") X <- as.matrix(X) : the condition has length > 1 and only the first element will be used Warning message: Usingas.character()on a quosure is deprecated as of rlang 0.3.0. Please useas_label()oras_name()` instead. This warning is displayed once per session. null device 1 null device 1

Is it possible to reduce the computational time for trajectory analysis

koenvandenberge commented 4 years ago

Hi @ankushs0128 do you mind sharing the code you are using, and some details on your trajectory (e.g., number of lineages, or a visualization)? Also note that tradeSeq does not use parallelization by default, which can be activated by setting parallel=TRUE, and setting your parallelization details, please see the vignette on how to do this. If you did not do this, then you will have little benefit from having this high number of cores.

The messages you are sharing with us seem to actually come from Slingshot (cc @kstreet13), according to me, so being able to take a look at your code would be helpful.

ankushs0128 commented 4 years ago

The code is pretty much the same as it is shown in the vignette. The input data is merged data for 4-time points d0, d7, d13, and d20 for differentiation protocol at cell lines. We don´t know yet about the lineages. Cells are showing differentiation as expected per the time point progression from d0 to d20.

R script:

experiment.merged <- readRDS("ctrlAllDays_RNA.SCT.integrated.rds")

# Generate slingshotdataset-object using umap embedding
# Start cluster: 3 and end clusters are either 6 or 9. 
sample_sds <- slingshot(Embeddings(experiment.merged, "umap"), clusterLabels = experiment.merged$seurat_clusters, start.clus = 3, end.clus = 9 )
pdf("MergedALL_UMAP.pdf", width=12, height=14, paper='special')
DimPlot(experiment.merged, reduction = 'umap', group.by = 'seurat_clusters', label=T)
dev.off() 
 d20Ctrl <- experiment.merged
# SCE object and Count matrix
DefaultAssay(d20Ctrl) <- "SCT" # Try integration adj data later!
sce_d20Ctrl <- as.SingleCellExperiment(d20Ctrl)
counts <- as.matrix(counts(sce_d20Ctrl))
crv <- sce_d20Ctrl
cluster_labels <- d20Ctrl$seurat_clusters

pdf("MergedALL_Colored by clusters.pdf.pdf", width=12, height=14, paper='special')
plotGeneCount(curve = sample_sds, clusters = cluster_labels,title = "Colored by clusters")
dev.off()
# Evaluate-k section
  # Evaluate K based on Slingshot object
  # It is generally a good idea to evaluate this multiple times using different seeds (using the set.seed function), 
  # to check whether the results are reproducible across different gene subsets.

  ### Based on Slingshot object
  set.seed(1234)
  icMat <- evaluateK(counts = counts, sds = sample_sds, k = 3:20, nGenes = 200, verbose = FALSE, plot= TRUE)
  print(icMat[1:2, ])

 ### Downstream of any trajectory inference method using pseudotime and cell weights
  set.seed(4321)
  pseudotime <- slingPseudotime(sample_sds, na=FALSE)
  cellWeights <- slingCurveWeights(sample_sds)
  icMat2 <- evaluateK(counts = counts, pseudotime = pseudotime, cellWeights = cellWeights,
                   k=10:30, nGenes = 200, verbose = FALSE, plot = TRUE)

BPPARAM <- BiocParallel::bpparam()
BPPARAM # lists current options
## class: MulticoreParam
##   bpisup: FALSE; bpnworkers: 2; bptasks: 0; bpjobname: BPJOB
##   bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
##   bpRNGseed: ; bptimeout: 2592000; bpprogressbar: FALSE
##   bpexportglobals: TRUE
##   bplogdir: NA
##   bpresultdir: NA
##   cluster type: FORK
BPPARAM$workers <- 64 

# FitGAM
  ### Based on Slingshot object
  DefaultAssay(d20Ctrl) <- 'integrated'
  genes <- VariableFeatures(d20Ctrl)
  pseudotime <- slingPseudotime(sample_sds, na = FALSE)
  cellWeights <- slingCurveWeights(sample_sds)
  sce <- fitGAM(counts = counts, pseudotime = pseudotime, cellWeights = cellWeights, nknots = 15, verbose = T)

sce_bckup <- sce
saveRDS(sce, file = "AFTER_FITGAM_STEP_MERGED.rds")

The RDS file is not saved so the computation time seems to be at this step

sh file:
`

Job name:

SBATCH --job-name=64core-tradeseq

#

Project:

SBATCH --account=Name of project

#

Wall clock limit:

SBATCH --time=42:00:00

#

Max memory usage: Size is a number plus M (megabyte) or G (gigabyte), e.g., 3M or 5G

SBATCH --mem-per-cpu=10G

#

Number of tasks (cores): this is added to make it easier for you to do exercises

SBATCH --ntasks=64

SBATCH --partition=bigmem

TASK STATUS EMAILS

SBATCH --mail-user=

SBATCH --mail-type=FAIL

SBATCH --mail-type=END

echo done`

TIme since script running with this memory allocation - 15:08:18 hours

Thanks Ankush

koenvandenberge commented 4 years ago

Hi @ankushs0128

ankushs0128 commented 4 years ago

Hi @koenvandenberge

I chose the knots = 15 as computed by Tradeseq and as suggested in the vignette as it was not falling in the range at 3:10. In any case, I also did what you suggested with 5 and 8 knots. But still, Tradeseq is taking long computation time with parallelization: the data set I´m using is merged 4 different timepoints. 1 Time-point data is also taking approx 12-18hrs with your suggested parameters which seems not to be optimal for the datasets. Optimal knots for this dataset is 10 knots.

koenvandenberge commented 4 years ago

Hi @ankushs0128

Please let us know how many genes, cells and lineages you are fitting. Without knowing this we cannot know whether your time is abnormal or not. Please also see #64 for a discussion on fitGAM runtime.

koenvandenberge commented 4 years ago

I am closing this due to inactivity. Feel free to reopen if further help would be required.