martinjzhang / scDRS

Single-cell disease relevance score (scDRS)
https://martinjzhang.github.io/scDRS/
MIT License
105 stars 13 forks source link

Loess error in compute-score after subsampling #25

Closed twoneu closed 2 years ago

twoneu commented 2 years ago

Hi scDRS team,

I am encountering a loess error when trying to run a subsampled, lognormalized h5ad file through scdrds compute-score:

Call: scdrs compute-score \ --h5ad-file ordovas_immune_downsampled.h5ad\ --h5ad-species human\ --cov-file None\ --gs-file all.gs\ --gs-species human\ --ctrl-match-opt mean_var\ --weight-opt vs\ --adj-prop None\ --flag-filter-data False\ --flag-raw-count False\ --n-ctrl 1000\ --flag-return-ctrl-raw-score False\ --flag-return-ctrl-norm-score True\ --out-folder scDRS/immune

Loading data: --h5ad-file loaded: n_cell=10727, n_gene=5000 (sys_time=0.2s) First 3 cells: ['N51.LPB.TCAGCAACATACGCTA-0', 'N111.LPA2.CTTCTCTGTGGCAAAC-0', 'N20.LPA.TGGATGTGCACTTT-0'] First 5 genes: ['A2M', 'A2M-AS1', 'A2ML1', 'AAED1', 'AAK1'] --gs-file loaded: n_trait=6 (sys_time=0.2s) Print info for first 3 traits: First 3 elements for 'ukbb_CRC': ['CLTCL1', 'DAB2IP', 'NBPF1'], [4.0019, 3.5014, 3.4717] First 3 elements for 'ukbb_IBD': ['ZNF236', 'C3orf38', 'LAMB2'], [3.17, 3.1623, 3.1358] First 3 elements for 'alkes_UC': ['FCGR2A', 'IL23R', 'DLD'], [7.964, 7.0309, 6.897]

Preprocessing: /Users/user/opt/anaconda3/lib/python3.9/site-packages/scdrs/pp.py:323: RuntimeWarning: invalid value encountered in log10 x = np.log10(df_gene["ct_mean"].values[not_const]) Traceback (most recent call last): File "/Users/user/opt/anaconda3/bin/scdrs", line 723, in fire.Fire() File "/Users/user/opt/anaconda3/lib/python3.9/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/Users/user/opt/anaconda3/lib/python3.9/site-packages/fire/core.py", line 466, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/Users/user/opt/anaconda3/lib/python3.9/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/Users/user/opt/anaconda3/bin/scdrs", line 211, in compute_score scdrs.preprocess(adata, cov=df_cov, adj_prop=ADJ_PROP, n_mean_bin=20, n_var_bin=20, copy=False) File "/Users/user/opt/anaconda3/lib/python3.9/site-packages/scdrs/pp.py", line 205, in preprocess df_gene, df_cell = compute_stats( File "/Users/user/opt/anaconda3/lib/python3.9/site-packages/scdrs/pp.py", line 325, in compute_stats model.fit() File "_loess.pyx", line 899, in _loess.loess.fit ValueError: b'Extrapolation not allowed with blending'

I used the following code to generate the subsampled dataset from the original data:

crc_immune_adata = sc.read_h5ad("ordovas_immune_processed.h5ad")
target_cells = 250

crc_immune_adata_ds = [crc_immune_adata[crc_immune_adata.obs.Cluster == clust] for clust in crc_immune_adata.obs.Cluster.cat.categories]

for dat in crc_immune_adata_ds:
    if dat.n_obs > target_cells:
         sc.pp.subsample(dat, n_obs=target_cells)

crc_immune_adata_downsampled = crc_immune_adata_ds[0].concatenate(*crc_immune_adata_ds[1:])
crc_immune_adata_downsampled.write("ordovas_immune_downsampled_250.h5ad")

The subsampled dataset is attached here: ordovas_immune_downsampled_250.h5ad.zip

Thanks for all your time and help!

KangchengHou commented 2 years ago
/Users/user/opt/anaconda3/lib/python3.9/site-packages/scdrs/pp.py:323: RuntimeWarning: invalid value encountered in log10
x = np.log10(df_gene["ct_mean"].values[not_const])

looks like there is a log(0) problem (which is quite strange). Somehow I am not able to read h5ad file. Could you send a raw text file (perhaps .csv)?

twoneu commented 2 years ago

ordovas_immune_downsampled.zip Hopefully this works! I'm very sorry if it ends up being a data format issue - I really appreciate your time and patience.

KangchengHou commented 2 years ago

@twoneu I was being confusing. I was referring to the expression matrix in .csv format (rows are cell, columns are genes). I can not open the h5ad matrix somehow.

Using df = pd.DataFrame(data=adata.X, index=adata.obs_names, column=adata.var_names) should work.

PS. within scdrs, there is a step for calculating the ct_mean (gene-wise count matrix average) where scdrs transform your data in log-transformed space to count space using np.expm1. One thing to check is to make sure your log-transformed expression values are all >= 0.

twoneu commented 2 years ago

Thank you so much - it turns out there was one NaN within my matrix that was throwing the error. This package is great and so are you!

KangchengHou commented 2 years ago

that's great! we will do some more checking on the input matrix in next version (including NaN, expression values >= 0 for log-transformed matrix).