Open nick-youngblut opened 5 days ago
I should note that including library(future); plan("sequential")
in the function called by Q
does not solve the stalling problem.
This works on my system. If I run Q
with log_worker=T
, I get the following error in the console (and no additional log error):
Error: 1/1 jobs failed (0 warnings). Stopping. (Error #1) could not find function "%>%"
If I then load the dplyr
package in addition to Seurat this runs without errors (on multiprocess
):
Q(fx, seurat_obj=c(pbmc), pkgs=c("Seurat", "dplyr"), n_jobs=1, job_size=1, memory=12 * 1024)
Thanks for the quick feedback, and thanks for pointing out the issue with my reprex. While stripping down my code, I accidentally removed dplyr
.
I've done some more testing, and it appears that the issue is with loading Seurat or SeuratData. If I simply run:
library(clustermq)
options(clustermq.scheduler = "slurm")
fx = function(x) {
library(Seurat)
return (x * 2)
}
Q(fx, x=1:3, n_jobs=3, pkgs=c("Seurat"), job_size=4, memory=8192)
The job takes 5 minutes. The same goes if I swap out Seurat
for SeuratData
. However, the job only takes 15-20 seconds if I swap out Seurat
with dplyr
.
Any ideas why the jobs are taking so long to just load R packages? Maybe this suggests just terrible performance in general, and loading Seurat requires more compute (maybe some I/O) than dplyr? The I/O is often a bottleneck on my HPC, so I wonder if that is causing the problem.
I'm also running clustermq from a Jupyter notebook running an R kernel. The conda env variables seem to be transferred to the clustermq jobs correctly, but maybe this is causing the issue.
The following does work, but it takes ~5.5 minutes:
seurat_preprocess = function(seurat_obj, dims=1:30) {
# load libraries
library(dplyr)
library(Seurat)
# pre-process seurat object
seurat_obj %>%
SCTransform() %>%
RunPCA() %>%
FindNeighbors(dims=dims) %>%
FindClusters() %>%
RunUMAP(dims=dims)
}
pbmc = Q(
seurat_preprocess,
seurat_obj=c(pbmc),
const=list(dims=1:30),
pkgs=c("dplyr", "Seurat"),
n_jobs=1,
job_size=1,
log_workers=TRUE,
template = list(
memory = 8 * 1024,
log_file = "clustermq.log"
))[[1]]
pbmc
Based on the logs, it appears that almost all of that time was spend on loading the Seurat package.
How long does loading the package in an interactive job take?
In any case, we're just executing the R function code, so I don't think clustermq
can do anything to make the package loading faster.
How long does loading the package in an interactive job take?
Just a couple of seconds.
There's obviously an issue with the clustermq jobs, if loading Seurat only takes a couple of seconds in an interactive session on the HPC (the session is running on a cluster node), while the package takes many minutes to load if submitted via clustermq as an sbatch job.
It's likely something odd about our Slurm setup. I'll work with the cluster admin. I'd appreciate any thoughts on the matter, if you have them.
It appears that various multi-processing seurat commands that use the future R package (e.g.,
ScaleData
) cause clustermq to "stall", and the jobs never complete.A reprex:
The SLURM cluster job stalls, but will quickly completes successfully if
seurat_obj %>% ScaleData()
is replaced with simplyreturn(seurat_obj)
.If
pbmc %>% ScaleData()
is used instead ofQ(fx, ...)
(no SLURM job), then theScaleData
process completes in just a couple of seconds.I'm guessing this issue is due to how
ScaleData
is parallelized.The following does work:
...so it's likely due to how
ScaleData
parallelizes data processing.The same "stalling" occurs with SCTransform, which I believe still uses
ScaleData
under the hood.Any ideas?
sessionInfo