martinjzhang / scDRS

Single-cell disease relevance score (scDRS)
https://martinjzhang.github.io/scDRS/
MIT License
98 stars 11 forks source link

compute-score can not complete #46

Closed Dan-121 closed 1 year ago

Dan-121 commented 1 year ago

Hi, thanks for developing such a helpful tool, but I have had some questions recently. When I run the compute-score process, the code can not finish. could you be so pleased to help me with the problem? Here are the code.


Call: scdrs compute-score \ --h5ad-file /data4/scDRS/data/cere/expr.h5ad \ --h5ad-species human \ --cov-file /data4/scDRS/data/cere/cov.tsv \ --gs-file /data4/scDRS/data/cere/processed_geneset.gs \ --gs-species human \ --ctrl-match-opt mean_var \ --weight-opt vs \ --adj-prop None \ --flag-filter-data True \ --flag-raw-count True \ --n-ctrl 1000 \ --flag-return-ctrl-raw-score False \ --flag-return-ctrl-norm-score True \ --out-folder /data4/scDRS/data/cere/out Loading data: --h5ad-file loaded: n_cell=62247, n_gene=23202 (sys_time=7.0s) First 3 cells: ['E083_AAACCCAAGGGCTGAT-1', 'E083_AAACCCACAGGCAATG-1', 'E083_AAACCCACAGTATACC-1'] First 5 genes: ['AL627309.1', 'AL627309.5', 'LINC01409', 'FAM87B', 'LINC01128'] --cov-file loaded: covariates=['const', 'n_genes', 'timepoint'] (sys_time=7.0s) First 5 values for 'const': [1, 1, 1, 1, 1] First 5 values for 'n_genes': [3861, 4883, 5453, 2459, 5002] First 5 values for 'timepoint': ['E083', 'E083', 'E083', 'E083', 'E083'] --gs-file loaded: n_trait=3 (sys_time=7.0s) Print info for first 3 traits: First 3 elements for 'SCZ': ['NRGN', 'DPYD', 'RBFOX1'], [7.6558, 7.6519, 7.3247] First 3 elements for 'CEREV': ['RNF11', 'CDKN2C', 'TRRAP'], [6.4221, 6.1533, 6.1347] First 3 elements for 'Height': ['WWOX', 'BNC2', 'GMDS'], [10.0, 10.0, 10.0]

Preprocessing: scdrs.pp.category2dummy: Detected categorical columns: timepoint. Added dummy columns: timepoint_E093,timepoint_E101,timepoint_E102,timepoint_E108,timepoint_E117. Dropped columns: timepoint.

Computing scDRS score: Trait=SCZ, n_gene=898: 165/62247 FDR<0.1 cells, 469/62247 FDR<0.2 cells (sys_time=839.1s) Trait=CEREV, n_gene=819: 0/62247 FDR<0.1 cells, 0/62247 FDR<0.2 cells (sys_time=1529.8s)


And the computer keeps running even 2 days after. Could you please help with the problem? looking forward to your relay, thanks.

martinjzhang commented 1 year ago

Hi @dandata123-tech , it seems scDRS completed for the first two traits (SCZ & CEREV), each taking around 800 seconds. If this is true, the software should have output the .score.gz and .full_score.gz files for the first two traits. Could you confirm it? It is indeed weird that the software got stuck when processing the third trait, which should take around the same time to complete (~800s). We can look into it if you can provide a minimal reproducible example.

Dan-121 commented 1 year ago

Hi, thanks for the in-time reply, I can get the output the .score.gz and .full_score.gz files for the first two traits, but got stuck when processing the third trait, and If I change the order of the gs file, I can get the first two traits two and get stuck in the third traits, It is ok when I run the example of our data.

martinjzhang commented 1 year ago

Hi @dandata123-tech ,

I suspect that your .gs file contains illegal values (such as NA or negative values for the gene weights). Please refer to https://martinjzhang.github.io/scDRS/file_format.html#gs for an example of the .gs file.

As diagnostics, you can create 3 separate .gs files for the 3 traits to see which one gives you the error. scDRS processes each trait independently, so running scDRS on the 3 separate .gs files should not change the results.

Dan-121 commented 1 year ago

Hi,thanks for your intime reply. I check the gs file and find that there is no illegal values and I try it on your sample gs. then I find something wrong if I run each trait independently.Here is the error.


Task exception was never retrieved future: <Task finished name='Task-13' coro=<ScriptMagics.shebang.._handle_stream() done, defined at /home/user/anaconda3/envs/dictys/lib/python3.10/site-packages/IPython/core/magics/script.py:211> exception=ValueError('Separator is not found, and chunk exceed the limit')> Traceback (most recent call last): File "/home/user/anaconda3/envs/dictys/lib/python3.10/asyncio/streams.py", line 525, in readline line = await self.readuntil(sep) File "/home/user/anaconda3/envs/dictys/lib/python3.10/asyncio/streams.py", line 603, in readuntil raise exceptions.LimitOverrunError( asyncio.exceptions.LimitOverrunError: Separator is not found, and chunk exceed the limit

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/user/anaconda3/envs/dictys/lib/python3.10/site-packages/IPython/core/magics/script.py", line 213, in _handle_stream line = (await stream.readline()).decode("utf8") File "/home/user/anaconda3/envs/dictys/lib/python3.10/asyncio/streams.py", line 534, in readline raise ValueError(e.args[0]) ValueError: Separator is not found, and chunk exceed the limit


Could you please help with the problem? looking forward to your relay, thanks.

martinjzhang commented 1 year ago

Hi @dandata123-tech

Thank you for following up. I am unable to identify the issue. The best way is to provide a minimal reproducible example. However, here are my guesses. The ValueError "ValueError: Separator is not found, and chunk exceed the limit" seems to indicate that scDRS couldn't parse the delimiters in your .gs file (\t or comma). Maybe it contains some non-English characters?

martinjzhang commented 1 year ago

Hi @dandata123-tech

Thank you for following up. Great that you have identified the issue.

Your procedures look about right. You can refer to this post for using MAGMA.

Dan-121 commented 1 year ago

Thank you.