martinjzhang / scDRS

Single-cell disease relevance score (scDRS)
https://martinjzhang.github.io/scDRS/
MIT License
105 stars 13 forks source link

Problems with input data #29

Closed Mia1037 closed 1 year ago

Mia1037 commented 1 year ago

Hi,

I have two questions that I would like to ask for your guidance.

  1. The examples I saw in the article were all scRNA-seq datas from multiple tissues associated with disease. Because I only want to focus on one disease, and I only have scRNA-seq data from one tissue, I wanted to ask if I could use just one scRNA-seq data from one tissue.
  2. Is there a more detailed tutorial on how to get from public scRNA-seq data to .h5ad file required for scDRS?

Any advice would be greatly appreciated!

martinjzhang commented 1 year ago

Hi,

Q1: Use data from a single tissue?

This is fine. We analyzed several such data sets in the paper (e.g., from blood, brain, or liver). However, you may have less detection power because scDRS implicitly compares a cell to other cells in the data set, and cells from the right tissue are more likely to be causal. Please see more details in discussion point 2 in the paper, copied below.

In the paper, we had sufficient power in the brain and liver data sets, but less power in the T cell data sets. For the former, we can do both individual cell-level and aggregate analyses. For the latter, we only performed aggregate analyses (associating diseases with fine-grained cell subtype annotations).

Second, the fact that scDRS assesses the statistical significance of an individual cell’s association to disease by implicitly comparing it to other cells via matched control genes may reduce power if most cells in the data are truly causal. For example, association with IBD in a data set containing only Treg cells (one of the causal cell types for IBD) will likely yield largely nonsignificant results. This limitation did not impact our main analyses, because the TMS data include a broad set of cell types; in more specialized data sets (which may be preferred in some settings due to the more comprehensive profiling of the focal cell population), this limitation can potentially be addressed by selecting matched control genes based on a broad cell atlas (for example, the TMS or TS data).

Best, Martin

martinjzhang commented 1 year ago

Hi,

Q2: How to get the .h5ad file?

I used python via the following steps:

  1. Download the scRNA-seq raw count matrix (.csv or .tsv format) along with the gene-level and cell-level covariate files (.csv or .tsv format). Since the file formats may vary, you may need extra preprocessing steps to get the .csv/.tsv files.
  2. Read the .csv/.tsv files into python using pandas.read_csv
  3. Create an anndata object using anndata.AnnData. Count matrix goes to the X argument, cell-level covariates go to the obs argument, and gene-level covariates go to the var argument.
  4. Write the anndata object in .h5ad format using anndata.AnnData.write

Seurat in R also offers functionalities for conversion between Seurat and AnnData .h5ad files. However, I haven't personally tested these functions.

We want to provide the best support for our users. May I ask if these pointers are sufficient? Would it be helpful to add a Jupyter notebook tutorial? Would it be helpful to add an scDRS CLI function, e.g., scdrs get-h5ad, to convert .csv/.tsv files to .h5ad format? Are there other functions you would like to have?

Best, Martin

Mia1037 commented 1 year ago

Thank you very much for your detailed explanation !

Sometimes, it is difficult to replicate the results for some large scRNA-seq data, but they provide meta data of cells and the tSNE coordinate of each cell (e.g., (-6.250967, 38.407021), so can I just use the raw count matrix (X) and the meta data of the cells (obs) and gene-level covariates (var) and the tSNE coordinate of each cell (obsm)**** to generate .h5ad file instead of normalization, dimensional reduction and other standardized single-cell analysis processes.

If the above method is feasible, it is embarrassing that I am not sure how to add tSNE to the.h5ad file.

martinjzhang commented 1 year ago

Hi,

This is indeed feasible. When you use anndata.AnnData to construct the anndata object, just add the argument obsm={'X_tnse':mat_tsne}, where mat_tsne is a numpy.ndarray of shape (n_cell, 2) containing the two TSNE coordinates of all cells. Since that's exactly what scanpy.tl.tsne outputs to the anndata object, you should be able to perform the analysis as if you computed the tSNE embeddings on your own.

Of course, after you have the anndata object, you can then use anndata.AnnData.write to get the .h5ad file.

Let me know if you have more questions.

Best, Martin

Mia1037 commented 1 year ago

Hi ,

I use the raw count matrix (X) and the meta data of the cells (obs) and gene-level covariates (var) and the tSNE coordinate of each cell (obsm) to generate .h5ad instead of normalization, dimensional reduction and other standardized single-cell analysis processes, so my annData Object has no obsp('distances', 'connectivities'). When I run scdrs perform-downstream to obtain the group-level statistics, the following information is displayed:

Performing scDRS group-analysis connectivities not found in adata.obsp; run sc.pp.neighbors first Finish group-analysis for SCZ (sys_time=32.6s)

When I ran the sample data(Zeisel & Muñoz-Manchado), I found that whether the anndata object has obsp('distances ', 'connectivities') or not has no effect on the cell score and assoc_mcp, but hetero_mcp results are different.

For my data , I would like to ask whether the effect of obsp('distances', 'connectivities') on hetero_mcp can be ignored?

KangchengHou commented 1 year ago

Calculation of hetero_mcp directly depend on the connectivities information. The displayed information indicated that scDRS found that there are no connectivities information in your adata, and is computing that from scratch.

So I think results of hetero_mcp should also make sense. And it is correct that cell score and assoc_mcp do not depend on connectivities