Open sid5427 opened 3 months ago
Regarding the first issue of running SCARlink on the entire data: was it run on a single core? SCARlink parallelizes over multiple cores and saves the output and log files separately for the runs on each core.
For the second problem, I would need to look at the data to determine the problem. I would imagine that the correlation would be None if the validation set of the gene had all 0s. Do you have the output for the same gene from the larger data set?
For the first part - no, I am giving it 48 cores. My command looks like this - scarlink -o multiome_out -g mm10 -np 48 -c R7
I am also launching the job manager with 48 cores. I only see one log file being generated each time I run the tool.
I'll have to check the results of the larger dataset - as I mentioned it's still running, and I am now worried it's running on just one core, instead of the allotted 48 cores...
Is there anyway to check how many cores the tool sees when we run it?
Not an author, but if you submit the job as an array job (ie for SGE scheduler, using the -t option for qsub, and $SGE_TASK_ID for scarlink -p), you can split the individual regression tests across multiple cores, as the tests run independently of each other and can be split. I just ran 2000 variable genes across 25k cells and 10 celltype clusters, with a decent amount of genes dropped due to sparsity, in just under 12 hours. The parallel environment option (-pe) would just be running the code on one core - I had this exact issue and thought it had to do with some gpu/cuda problems, but allocating 1 slot or 20 slots did not change the runtime, so I ran each task on a single core across an array of jobs.
My HPC account access allowed 9 processes to be run at a time, and I set scarlink -np to 1000, so my output has 1000 logs and 1000 coefficient .h5 files with 2 genes evaluated each. scarlink_tiles ran in a few minutes, and generated ~1GB csv file.
Here's the script I submitted to the scheduler using the Docker image:
. /u/local/Modules/default/init/modules.sh
module load apptainer
SIF_PATH="SCARlink/Docker/scarlink_latest.sif"
CMD="scarlink -o multiome_out -g mm10 -c cell_type -p $SGE_TASK_ID -np 1000"
apptainer exec $SIF_PATH $CMD
@yahbish - Thanks for the info - unfortunately our institute's cluster uses a different scheduling system - We have a LSF based cluster, and I assume both the authors and your cluster is Grid Engine based (SGE) - I am not sure if the commands are transferable.
Try this for LSF, which it looks like the authors also use:
and for the -p argument, use $LSB_JOBINDEX
The GPU may be optional, but I ran mine with it requested
@yahbish Hmm I tried this ... but it just launched one job across 24 cores....
my bsub scrip looks like this ...
#!/bin/bash
#BSUB -J "scarlink_job[1-24]"
#BSUB -W 24:00
#BSUB -R "rusage[mem=2000]"
#BSUB -n 1
#BSUB -e logs_2/scarlink_s2_marker_genes_%J_%I.err
#BSUB -o logs_2/scarlink_s2_marker_genes_%J_%I.out
module load anaconda3
#module load cuda/11.7 ##disabled GPU for now
source activate scarlink-env
echo "running scarlink"
scarlink -o multiome_out_MARKER_GENES_second -g mm10 -np 24 -c R7 -p $LSB_JOBINDEX
Should I be running the job with cat << EOF
?
Did you get 24 log files and scarlink_output files?
Try this for LSF, which it looks like the authors also use:
!/bin/bash #BSUB -J "scarlinkjob[1-100]" #BSUB -o output%J%I.log #BSUB -e error%J_%I.log #BSUB -R "rusage[mem=2000]" #BSUB -R "select[gpu]" #BSUB -n 1
and for the -p argument, use $LSB_JOBINDEX
The GPU may be optional, but I ran mine with it requested
@yahbish yes, we also use LSF. This version might run faster on the CPU.
Hi all,
Yes I was able to run the array job. Was some issue with the array job settings which our IT support fixed.
The program produced 100 logs and 100 h5ads files in the output directory. The next step in scarlink was also able to complete.
For future reference this is the script I ran.
#!/bin/bash
#BSUB -J "scarlink_job[1-100]"
#BSUB -W 24:00
#BSUB -R "rusage[mem=16G]"
#BSUB -M 16000
#BSUB -n 1
#BSUB -e scarlink__PARALLEL_log_%J_%I.err
#BSUB -o scarlink__PARALLEL_log_%J_%I.out
module load anaconda3
source activate scarlink-env
scarlink -o multiome_out -g mm10 -np 100 -c cell_cluster -p $LSB_JOBINDEX
Hi Authors,
I am trying to run SCARlink on some of our own data. I was able to run and finish the tutorial successfully. All packages are installed as per the instructions.
To elaborate - We have some multiome data. I generated the seurat object which includes the entire GEX matrix from which top 5000 variables genes were calculated. For the archR object; as suggested - with the 'tilematrix' matrix within it and peaks called with macs2. I did however not generate clusters with seurat's function, instead we have a curated barcode to cell cluster annotation, which was added in to the seurat and archR objects. Also made sure we have most of the barcodes are present in both objects.
When I run with this larger dataset - the tool seems to be running fine, with some genes being dropped as they are too sparse, while some which are not sparse proceeding to the regression computation - for example -
However this is taking a while ... One job was started with 48 threads and 256 gb ram on our cluster, and it's worked through about 400genes in 3 days.
So we decided to just run the tool on a small set of genes of interest - some were reported to be sparse, which is fine. However the ones which are not sparse, simply do not finish with the regression computation and I get this error - Note: I simply filtered the main seurat object for these genes and created one which I call subset_seurat_object.
Any clues as to what might be the issue? Some of them are very highly expressed - checked by doing a sum of expression value across all cells.
Any help would be appreciated. Sid