martinjzhang / scDRS

Single-cell disease relevance score (scDRS)
https://martinjzhang.github.io/scDRS/
MIT License
106 stars 13 forks source link

Questions about reproducing results #100

Closed HelloWorldLTY closed 3 weeks ago

HelloWorldLTY commented 3 weeks ago

Hi, thanks for your great work. I downloaded the related files and tried to reproduce the score analysis for Alzehaimer disease with provided score list. However, it seems that after unzipping the score file, the AZ GWAS score file is named as in our output:

PASS_Alzheimers_Jansen2019.score.gz,

which is not a full_score file. Therefore, after running scDRS analysis in this step:

for trait in ["PASS_Alzheimers_Jansen2019"]:
    !scdrs perform-downstream \
        --h5ad-file integrated_data.h5ad \
        --score-file data/{trait}.score.gz \
        --out-folder data/ \
        --group-analysis celltype \
        --flag-filter-data True \
        --flag-raw-count True

I will receive an error:

AssertionError: Expect scDRS .full_score.gz files for score_file

The .gz file is generated based on :

!scdrs compute-score \
    --h5ad-file "./integrated_data.h5ad" \
    --h5ad-species human \
    --gs-file data/gs_file/magma_10kb_top1000_zscore.74_traits.rv1.gs \
    --gs-species human \
    --cov-file None \
    --flag-filter-data True \
    --flag-raw-count True \
    --flag-return-ctrl-raw-score False \
    --flag-return-ctrl-norm-score True \
    --out-folder data/

Did I miss anything? Thanks.

martinjzhang commented 3 weeks ago

Please use the PASS_Alzheimers_Jansen2019.full_score.gz file instead of the PASS_Alzheimers_Jansen2019.score.gz file as input for perform-downstream

HelloWorldLTY commented 3 weeks ago

After running the code from "data/gs_file/magma_10kb_top1000_zscore.74_traits.rv1.gs", I did not generate "PASS_Alzheimers_Jansen2019.full_score.gz". Did I miss anything?

HelloWorldLTY commented 3 weeks ago

If I only subset the alzeimers related score, this is the log:

******************************************************************************
* Single-cell disease relevance score (scDRS)
* Version 1.0.2
* Martin Jinye Zhang and Kangcheng Hou
* HSPH / Broad Institute / UCLA
* MIT License
******************************************************************************
Call: scdrs compute-score \
--h5ad-file ./integrated_data.h5ad \
--h5ad-species human \
--cov-file None \
--gs-file data/gs_file/alzeh.gs \
--gs-species human \
--ctrl-match-opt mean_var \
--weight-opt vs \
--adj-prop None \
--flag-filter-data True \
--flag-raw-count True \
--n-ctrl 1000 \
--flag-return-ctrl-raw-score False \
--flag-return-ctrl-norm-score True \
--out-folder data/

Loading data:
--h5ad-file loaded: n_cell=92596, n_gene=15890 (sys_time=17.1s)
First 3 cells: ['AAACAGCCAACATAAG-1-0-0-0-0-0-0-0', 'AAACAGCCAACTAACT-1-0-0-0-0-0-0-0', 'AAACAGCCAAGCCAGA-1-0-0-0-0-0-0-0']
First 5 genes: ['A1bg', 'A1bg-as1', 'A2m', 'A2ml1', 'A2ml1-as1']
--gs-file loaded: n_trait=0 (sys_time=17.1s)
Print info for first 3 traits:

Preprocessing:

Computing scDRS score:
martinjzhang commented 3 weeks ago

Your run didn't read in the gs file. can you double check?

HelloWorldLTY commented 3 weeks ago
df_gs = pd.read_csv("data/gs_file/magma_10kb_top1000_zscore.74_traits.rv1.gs", sep="\t", index_col=0)

df_gs.loc['PASS_Alzheimers_Jansen2019'].to_csv('data/gs_file/alzeh.gs', index=False, header=False, sep='\t')

Is it the correct code to subset GWAS score file? Thanks. If so, I think there is no problem of my data input.

martinjzhang commented 3 weeks ago

Check if your data/gs_file/alzeh.gs has the same format as https://martinjzhang.github.io/scDRS/file_format.html#gs

HelloWorldLTY commented 3 weeks ago

Hi, the format looks good to me:

<img width="790" alt="image" src="https://github.com/user-attachments/assets/9dba44d6-c3fb-42ab-bf5f-81ba3d1e5a45">

I will try both subsample and whole traits again and get back to you later, thanks.

HelloWorldLTY commented 3 weeks ago

Hi, I tried the method again, and I still cannot find the full_score file. Only score file exists:

PASS_Alzheimers_Jansen2019.score.gz

Is it caused by:

Preprocessing:

Computing scDRS score:
trait=PASS_Alzheimers_Jansen2019: skipped due to small size (n_gene=7, sys_time=22.5s)

And my sample size is too small? Thanks.

martinjzhang commented 3 weeks ago

your gene set size is too small (requirement >10 genes). I don't think scDRS has output anything PASS_Alzheimers_Jansen2019.score.gz is a file that already exists.

HelloWorldLTY commented 3 weeks ago

Thanks, that makes sense.