snehamitra / SCARlink

32 stars 6 forks source link

issue running SCARlink on filtered genes #7

Open sid5427 opened 3 months ago

sid5427 commented 3 months ago

Hi Authors,

I am trying to run SCARlink on some of our own data. I was able to run and finish the tutorial successfully. All packages are installed as per the instructions.

To elaborate - We have some multiome data. I generated the seurat object which includes the entire GEX matrix from which top 5000 variables genes were calculated. For the archR object; as suggested - with the 'tilematrix' matrix within it and peaks called with macs2. I did however not generate clusters with seurat's function, instead we have a curated barcode to cell cluster annotation, which was added in to the seurat and archR objects. Also made sure we have most of the barcodes are present in both objects.

When I run with this larger dataset - the tool seems to be running fine, with some genes being dropped as they are too sparse, while some which are not sparse proceeding to the regression computation - for example -

2024-04-01 12:36:36 INFO     Training regression model on Trpa1
2024-04-01 12:36:36 INFO     Trpa1 expression too sparse with sparsity = 0.9995555423864411. Maximum number of zeros allowed is 0.9. Returning...
2024-04-01 12:36:36 INFO     Training regression model on Terf1
2024-04-01 12:36:36 INFO     Performing 5-fold cross validation...
2024-04-01 13:03:05 INFO     Avg. cross-validation spearman correlation: 0.0295485016512955
2024-04-01 13:03:05 INFO     Chosen regularization parameter: 20
2024-04-01 13:03:06 INFO     Spearman corr on test set: 0.0446074921323468

However this is taking a while ... One job was started with 48 threads and 256 gb ram on our cluster, and it's worked through about 400genes in 3 days.

So we decided to just run the tool on a small set of genes of interest - some were reported to be sparse, which is fine. However the ones which are not sparse, simply do not finish with the regression computation and I get this error - Note: I simply filtered the main seurat object for these genes and created one which I call subset_seurat_object.

024-04-02 23:16:41 INFO     Training regression model on Rasal2
2024-04-02 23:16:41 INFO     Performing 5-fold cross validation...
2024-04-03 00:01:46 INFO     Avg. cross-validation spearman correlation: None
2024-04-03 00:01:46 INFO     Chosen regularization parameter: None
2024-04-03 00:01:46 INFO     ERROR: Regression could not be estimated. Returning...
2024-04-03 00:01:46 INFO     Training regression model on Cd247
2024-04-03 00:01:46 INFO     Cd247 expression too sparse with sparsity = 0.9405019407982459. Maximum number of zeros allowed is 0.9. Returning... 

Any clues as to what might be the issue? Some of them are very highly expressed - checked by doing a sum of expression value across all cells.

Any help would be appreciated. Sid

snehamitra commented 3 months ago

Regarding the first issue of running SCARlink on the entire data: was it run on a single core? SCARlink parallelizes over multiple cores and saves the output and log files separately for the runs on each core.

For the second problem, I would need to look at the data to determine the problem. I would imagine that the correlation would be None if the validation set of the gene had all 0s. Do you have the output for the same gene from the larger data set?

sid5427 commented 3 months ago

For the first part - no, I am giving it 48 cores. My command looks like this - scarlink -o multiome_out -g mm10 -np 48 -c R7

I am also launching the job manager with 48 cores. I only see one log file being generated each time I run the tool.

I'll have to check the results of the larger dataset - as I mentioned it's still running, and I am now worried it's running on just one core, instead of the allotted 48 cores...

Is there anyway to check how many cores the tool sees when we run it?

yahbish commented 3 months ago

Not an author, but if you submit the job as an array job (ie for SGE scheduler, using the -t option for qsub, and $SGE_TASK_ID for scarlink -p), you can split the individual regression tests across multiple cores, as the tests run independently of each other and can be split. I just ran 2000 variable genes across 25k cells and 10 celltype clusters, with a decent amount of genes dropped due to sparsity, in just under 12 hours. The parallel environment option (-pe) would just be running the code on one core - I had this exact issue and thought it had to do with some gpu/cuda problems, but allocating 1 slot or 20 slots did not change the runtime, so I ran each task on a single core across an array of jobs.

My HPC account access allowed 9 processes to be run at a time, and I set scarlink -np to 1000, so my output has 1000 logs and 1000 coefficient .h5 files with 2 genes evaluated each. scarlink_tiles ran in a few minutes, and generated ~1GB csv file.

Here's the script I submitted to the scheduler using the Docker image:

!/bin/bash

$ -cwd

$ -N scarlink_job

$ -o logs/$JOB_NAME.o$JOB_ID.$TASK_ID

$ -e logs/$JOB_NAME.e$JOB_ID.$TASK_ID

$ -t 1-1000

$ -l h_data=2G

$ -l gpu

$ -l h_rt=24:00:00

. /u/local/Modules/default/init/modules.sh

module load apptainer

SIF_PATH="SCARlink/Docker/scarlink_latest.sif"

CMD="scarlink -o multiome_out -g mm10 -c cell_type -p $SGE_TASK_ID -np 1000"

apptainer exec $SIF_PATH $CMD

sid5427 commented 3 months ago

@yahbish - Thanks for the info - unfortunately our institute's cluster uses a different scheduling system - We have a LSF based cluster, and I assume both the authors and your cluster is Grid Engine based (SGE) - I am not sure if the commands are transferable.

yahbish commented 3 months ago

Try this for LSF, which it looks like the authors also use:

!/bin/bash

BSUB -J "scarlink_job[1-100]"

BSUB -o output%J%I.log

BSUB -e error%J%I.log

BSUB -R "rusage[mem=2000]"

BSUB -R "select[gpu]"

BSUB -n 1

and for the -p argument, use $LSB_JOBINDEX

The GPU may be optional, but I ran mine with it requested

sid5427 commented 3 months ago

@yahbish Hmm I tried this ... but it just launched one job across 24 cores....

my bsub scrip looks like this ...

#!/bin/bash                                                                                                                                         

#BSUB -J "scarlink_job[1-24]"                                                                                                                      
#BSUB -W 24:00                                                                                                                                      
#BSUB -R "rusage[mem=2000]"                                                                                                                         
#BSUB -n 1                                                                                                                                          
#BSUB -e logs_2/scarlink_s2_marker_genes_%J_%I.err                                                                                           
#BSUB -o logs_2/scarlink_s2_marker_genes_%J_%I.out                                                                                           

module load anaconda3
#module load cuda/11.7 ##disabled GPU for now                                                                                                                              

source activate scarlink-env                                                                                                                       

echo "running scarlink"                                                                              
scarlink -o multiome_out_MARKER_GENES_second -g mm10 -np 24 -c R7 -p $LSB_JOBINDEX

Should I be running the job with cat << EOF ?

snehamitra commented 3 months ago

Did you get 24 log files and scarlink_output files?

snehamitra commented 3 months ago

Try this for LSF, which it looks like the authors also use:

!/bin/bash #BSUB -J "scarlinkjob[1-100]" #BSUB -o output%J%I.log #BSUB -e error%J_%I.log #BSUB -R "rusage[mem=2000]" #BSUB -R "select[gpu]" #BSUB -n 1

and for the -p argument, use $LSB_JOBINDEX

The GPU may be optional, but I ran mine with it requested

@yahbish yes, we also use LSF. This version might run faster on the CPU.

sid5427 commented 3 months ago

Hi all,

Yes I was able to run the array job. Was some issue with the array job settings which our IT support fixed.

The program produced 100 logs and 100 h5ads files in the output directory. The next step in scarlink was also able to complete.

For future reference this is the script I ran.

#!/bin/bash

#BSUB -J "scarlink_job[1-100]" 
#BSUB -W 24:00
#BSUB -R "rusage[mem=16G]"
#BSUB -M 16000
#BSUB -n 1
#BSUB -e scarlink__PARALLEL_log_%J_%I.err 
#BSUB -o scarlink__PARALLEL_log_%J_%I.out 

module load anaconda3
source activate scarlink-env

scarlink -o multiome_out -g mm10 -np 100 -c cell_cluster -p $LSB_JOBINDEX