martinjzhang / scDRS

Single-cell disease relevance score (scDRS)
https://martinjzhang.github.io/scDRS/
MIT License
98 stars 11 forks source link

GWAS traits and a control traits #49

Closed Li-mengjie closed 1 year ago

Li-mengjie commented 1 year ago

Hello! I have a problem and I hope to get your help. I was going to incorporate multiple GWAS traits and a control (height) for the study but couldn't figure out how to produce a gs-file. One trait and height are listed in the tutorial. If you add traits, do you need to list multiple traits and height at the same time, and take the intersection genes as the column? However, there is a problem with this. As the number of included traits increases, the number of shared genes decreases, which is not even enough to meet the analysis requirements. I wonder if there is any mistake in my understanding. I hope to get your reply.

martinjzhang commented 1 year ago

Hi @Li-mengjie , thank you for the question.

There is no need to take the intersection. scDRS processes each trait (one row in the .gs file) separately. In other words, putting all traits in the same file or having a different file for a trait gives the same results. So you can either put all traits in the same .gs file (one row per trait) or process the traits separately using different .gs files.

Li-mengjie commented 1 year ago

Thank you for your reply. I understand what you mean. But there's still some doubt about my question. I would like to know if I need to take the gene intersection of multiple study and control traits before producing the gs- file when working with them. If not, what should be done with the corresponding missing values?

Li-mengjie commented 1 year ago

For example, what should I do with this missing value

屏幕截图 2022-12-05 121853

martinjzhang commented 1 year ago

Hi @Li-mengjie, thank you for the follow-up. It depends on how the missing values patterns. If that's because of some artifacts in MAGMA and there are not a lot of missing values for each trait (e.g., <10%), I suggest just putting a P=1 for each missing gene-trait pair. If you determine that the set of non-missing genes is very different between two traits (e.g., 10% overlap of non-missing genes between ALS and height), then the results of the two traits will not be comparable.

Here is what I would do if I were you. For the main analysis, consider all genes (or genes non-missing in 50%/some-other-number of traits). Then add a secondary analysis, take two traits (e.g., ALS and height), and show that the scDRS results using only the shared non-missing genes are highly correlated with the main analysis results.

Li-mengjie commented 1 year ago

Thanks for your reply. I will have a try.

Li-mengjie commented 1 year ago

Hello! Thank you for your patient answer. I found a problem with the next step. 微信图片_20221205175721

In the next step, I used adata,obs: 133186x14151, but it did not run successfully and no errors were reported. 微信图片_20221205175734

Then I use the same data, but not pre-processed, obs: 133186x50939; Again, it didn't run successfully, and no errors were reported. 微信图片_20221205175743

Then I switched to the processed adata.obs41435x26675 and it ran successfully. What requirements should adata fulfill? 微信图片_20221205175729

martinjzhang commented 1 year ago

Hi @Li-mengjie , thank you for reporting the issue. Could you provide more details on "didn't run successfully"? Did the software just hang there? Were there any outputs in the .log file?

I am unsure about the cause. There might be a memory issue. Have you allocated enough computing resources? For a data set with 133186x50939, I think scDRS needs around 32-48GB of RAMs. Filtering out low-quality genes and cells will substantially (and linearly) reduce the computational requirement.

I suggest using the raw count data for scDRS, but removing low-quality genes and cells.

Li-mengjie commented 1 year ago

There was enough memory, and normally this step would have taken a long time, but in my case it ended in less than 2 minutes, with no report or error. 微信图片_20221205182917

Li-mengjie commented 1 year ago

Hello! I have encountered this situation before, so it is considered to be a problem of data processing, so I would like to ask what requirements should be met for data processing.

martinjzhang commented 1 year ago

Hi @Li-mengjie , the data needs to be in .h5ad format whose adata.X entry stores the raw count matrix. The file is parsed by the anndata read_h5ad function in scDRS. I notice that your input file is in .h5 format instead of .h5ad format. Maybe that caused the issue?

Li-mengjie commented 1 year ago

Hello! Sorry for the error, I corrected the error later but still did not get the result, and the situation is the same as above. Later I observed that in the tutorial adata.X was unscaled, whereas in my data adata.X was already scaled. I don't know if that's the problem, but I'm trying.I hope you can get your advice. The following two graphs are tutorial data and my data respectively. 微信图片_20221205205700 微信图片_20221205205705

martinjzhang commented 1 year ago

Aha! This will indeed cause the issue. Please use the raw count matrix with nonnegative integer values.