Data sparsity is a critical problem for the inference stability (ex. 10x data)

yyoshiaki commented 1 year ago

Hi, Thank you for the wonderful tool. I noticed that the detection power for sparse data such as 10x is very low. Imputation improved the stability, but I want to avoid using imputation if possible because it can distort the data.

I'm attaching an example using the pbmc3k dataset. Regarding attaching examples, autoimmune diseases are not still significant... Is there any way to deal with these?

Disease scores for raw expression value notebook

Disease scores for imputed expression value notebook

codes for the results

martinjzhang commented 1 year ago

Hi,

Thank you for the question. The imputation results look great!

Another possibility for the low power is that the background cells are also moderately relevant to the disease (PBMCs) and may kill the signal. We are working on a feature that could potentially address this issue (see https://github.com/martinjzhang/scDRS/issues/30#issuecomment-1249971502).

Have you tried adjusting for cell type proportions using scdrs compute-score --adj-prop? Monocyte count seems to have a signal in the raw data results. So maybe the cell type proportions are not very balanced and the disease signals associated to the major cell types are killed because scDRS uses all cells from the data set as the background (which are mostly from the major cell types).

yyoshiaki commented 1 year ago

Thank you very much for your quick response. Regarding #30, I'm looking forward to seeing the new feature or tutorials.

I also tried --adj-prop, but the following error occurred on the pbmc3k dataset... I'm attaching commands and codes for reproducing the issue.

scdrs compute-score --h5ad-file ../scanpy/pbmc/pbmc.h5ad --h5ad-species human --gs-file ../data/gs_file/magma_10kb_top1000_zscore.75_traits.batch/batch0.gs --gs-species human --cov-file ../scanpy/pbmc/pbmc.cov.tsv --flag-filter-data False --flag-raw-count True --flag-return-ctrl-raw-score False --flag-return-ctrl-norm-score True --adj-prop cell_type --out-folder ../scanpy/pbmc/scDRS_nofilter_adj-prop
******************************************************************************
* Single-cell disease relevance score (scDRS)
* Version 1.0.2
* Martin Jinye Zhang and Kangcheng Hou
* HSPH / Broad Institute / UCLA
* MIT License
******************************************************************************
Call: scdrs compute-score \
--h5ad-file ../scanpy/pbmc/pbmc.h5ad \
--h5ad-species human \
--cov-file ../scanpy/pbmc/pbmc.cov.tsv \
--gs-file ../data/gs_file/magma_10kb_top1000_zscore.75_traits.batch/batch0.gs \
--gs-species human \
--ctrl-match-opt mean_var \
--weight-opt vs \
--adj-prop cell_type \
--flag-filter-data False \
--flag-raw-count True \
--n-ctrl 1000 \
--flag-return-ctrl-raw-score False \
--flag-return-ctrl-norm-score True \
--out-folder ../scanpy/pbmc/scDRS_nofilter_adj-prop

Loading data:
--h5ad-file loaded: n_cell=2638, n_gene=32738 (sys_time=0.1s)
First 3 cells: ['AAACATACAACCAC-1', 'AAACATTGAGCTAC-1', 'AAACATTGATCAGC-1']
First 5 genes: ['MIR1302-10', 'FAM138A', 'OR4F5', 'RP11-34P13.7', 'RP11-34P13.8']
--cov-file loaded: covariates=['n_genes', 'const'] (sys_time=0.1s)
First 5 values for 'n_genes': [781, 1352, 1131, 960, 522]
First 5 values for 'const': [1, 1, 1, 1, 1]
--gs-file loaded: n_trait=5 (sys_time=0.1s)
Print info for first 3 traits:
First 3 elements for 'PASS_ADHD_Demontis2018': ['ST3GAL3', 'KDM4A', 'PTPRF'], [6.4588, 6.2164, 5.8681]
First 3 elements for 'PASS_Alzheimers_Jansen2019': ['BCAM', 'CEACAM16', 'PVR'], [8.2095, 7.7059, 7.3852]
First 3 elements for 'PASS_AtrialFibrillation_Nielsen2018': ['KCNN3', 'DCST1', 'CAV2'], [7.911, 7.8622, 7.8516]

Preprocessing:
/home/yyasumizu/Programs/scDRS/scdrs/pp.py:378: RuntimeWarning: invalid value encountered in log10
  x = np.log10(df_gene["ct_mean"].values[not_const])
Traceback (most recent call last):
  File "/home/yyasumizu/anaconda3/envs/scDRS/bin/scdrs", line 740, in <module>
    fire.Fire()
  File "/home/yyasumizu/anaconda3/envs/scDRS/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/yyasumizu/anaconda3/envs/scDRS/lib/python3.9/site-packages/fire/core.py", line 466, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/yyasumizu/anaconda3/envs/scDRS/lib/python3.9/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/yyasumizu/anaconda3/envs/scDRS/bin/scdrs", line 211, in compute_score
    scdrs.preprocess(
  File "/home/yyasumizu/Programs/scDRS/scdrs/pp.py", line 260, in preprocess
    df_gene, df_cell = compute_stats(
  File "/home/yyasumizu/Programs/scDRS/scdrs/pp.py", line 380, in compute_stats
    model.fit()
  File "_loess.pyx", line 899, in _loess.loess.fit
ValueError: b'Extrapolation not allowed with blending'

Installation of scDRS

conda create -n scDRS python=3.9
conda activate scDRS

git clone https://github.com/martinjzhang/scDRS.git
cd scDRS
git checkout -b v102 v1.0.2
pip install -e .

Preps of datasets

martinjzhang commented 1 year ago

Hi,

Could you check that adata.obs.columns contains the column cell_type (input to adj-prop) and that your data doesn't contain negative values? If so and the error persists, could you send us a minimal reproducible example?

yyoshiaki commented 1 year ago

Thank you, it contains the cell_type column.

In [8]: adata.obs
Out[8]: 
                        cell_type  n_genes
cell_id                                   
AAACATACAACCAC-1      CD4 T cells      781
AAACATTGAGCTAC-1          B cells     1352
AAACATTGATCAGC-1      CD4 T cells     1131
AAACCGTGCTTCCG-1  CD14+ Monocytes      960
AAACCGTGTATGCG-1         NK cells      522

Here is the notebook for the preparation https://github.com/yyoshiaki/scDRS_example/blob/master/notebook/prep_data.ipynb and h5ad https://github.com/yyoshiaki/scDRS_example/blob/master/scanpy/pbmc/pbmc.h5ad file.

martinjzhang commented 1 year ago

Hi @yyoshiaki ,

Thank you for the information. The error corresponds to a numerical issue due to the inclusion of many genes with 0 expression (16159/32738 of genes in the data). This is a boundary case we are not handling perfectly right now, so scDRS may or may not give errors when you try different things. This probably explains why the error didn't occur before. I will add it to my to-do list.

I suggest turning on the --flag-filter-data True flag to avoid this boundary case. This flag filters out genes expressed in <50 cells and cells expressing <250 genes.

martinjzhang commented 1 year ago

Hi @yyoshiaki ,

Thank you again for reporting the issue. We just pushed a new version v1.0.3 to fix this issue. Now you can run scDRS without turning on --flag-filter-data.

To use this version, please pull info from the latest git repo and update using pip install -e .. We will upload this version to PyPI after we have made a good number of changes to the software.

yyoshiaki commented 1 year ago

Hi Martin,

Thank you very much for the quick response and fixation. I could confirm --adj-prop without --flag-filter-data used v1.0.3. I've actually turned off the filter option because for genes only expressed in small clusters, 50 cells seems a bit stringent. It resulted in generating the boundary case.

Anyway, adj-prop slightly changed the result. I'm not sure whether it is an improvement, but I will play around using several datasets.

Thank you again for making such a nice tool and maintaining it.

Yoshi

left : no --adj-prop, right : --adj-prop

cf. # cells for each cluster

CD4 T cells          1144
CD14+ Monocytes       480
B cells               342
CD8 T cells           316
NK cells              154
FCGR3A+ Monocytes     150
Dendritic cells        37
Megakaryocytes         15

martinjzhang commented 1 year ago

Thank you @yyoshiaki for sharing the results w/ and w/o --adj-prop. These results are helpful for informing our future investigation!

Best, Martin

maryellenlynall commented 1 year ago

Helpful to see this exchange - thank you. I too am having issues with not seeing an expected signal in 10X data so keen to try imputing - can I ask if either of you have a tool / settings you recommend for imputation given you got good improvement with imputation here. Thanks!

martinjzhang commented 1 year ago

Hi @maryellenlynall

Try MAGIC van Dijk Cell 2018. Our preliminary results show that it improves power and provides conservative estimates.

maryellenlynall commented 1 year ago

Super, thanks!

martinjzhang / scDRS

Data sparsity is a critical problem for the inference stability (ex. 10x data) #32

Disease scores for raw expression value notebook

Disease scores for imputed expression value notebook