martinjzhang / scDRS

Single-cell disease relevance score (scDRS)
https://martinjzhang.github.io/scDRS/
MIT License
98 stars 11 forks source link

trait=z_score: skipped due to small size (n_gene=5, sys_time=116.1s) #82

Closed Lualululu closed 3 months ago

Lualululu commented 4 months ago

Hello,

I'm encountering an issue with using a custom .gs file for disease-related SNP analysis. My workflow involves generating a .gs file from disease-related SNP sites, and this particular file includes only 6 genes. However, when I attempt to compute scores using this file, I receive a message indicating that the gene set is too small.

`Call: scdrs compute-score \ --h5ad-file scRNA_32.h5ad \ --h5ad-species human \ --cov-file None \ --gs-file out_file.gs \ --gs-species human \ --ctrl-match-opt mean_var \ --weight-opt vs \ --adj-prop None \ --flag-filter-data True \ --flag-raw-count True \ --n-ctrl 1000 \ --flag-return-ctrl-raw-score False \ --flag-return-ctrl-norm-score True \ --out-folder out

Loading data: --h5ad-file loaded: n_cell=184706, n_gene=23748 (sys_time=56.7s) First 3 cells: ['210203_A00268_0605_BHWCMWDSXY_AAACCCAAGAGGCCAT-1', '210203_A00268_0605_BHWCMWDSXY_AAACCCAAGATGCGAC-1', '210203_A00268_0605_BHWCMWDSXY_AAACCCAAGCTCTATG-1'] First 5 genes: ['AL627309.1', 'AL627309.5', 'LINC01409', 'FAM87B', 'LINC01128'] --gs-file loaded: n_trait=1 (sys_time=56.9s) Print info for first 3 traits: First 3 elements for 'z_score': ['HLA-B', 'ERAP1', 'KIFAP3'], [5.4383, 5.069, 4.8556]

Preprocessing:

Computing scDRS score: trait=z_score: skipped due to small size (n_gene=5, sys_time=116.1s) `

I am wondering if there is a minimum gene set size requirement for the analysis to proceed? And if so, is there any workaround or recommendation for cases where the gene set naturally contains a small number of genes due to the specificity of the disease-related SNP sites being analyzed?

Any insights or suggestions on how to proceed with such small gene sets would be greatly appreciated.

Thank you for your time and assistance.

martinjzhang commented 4 months ago

Hi,

We recommend using scDRS with genesets containing >=10 genes. Applying scDRS to smaller genesets is probably fine. But the results need to be interpreted with caution.

There is a way to work around this check.

  1. make sure that you installed scDRS from github (not PyPI)
  2. go to the file ./bin/scdrs in your local scDRS folder
  3. comment the following lines out (lines 257-262):
        if len(gene_list) < 10:
            print(
                "trait=%s: skipped due to small size (n_gene=%d, sys_time=%0.1fs)"
                % (trait, len(gene_list), time.time() - sys_start_time)
            )
            continue

Then scDRS will skip this geneset size check.