morris-lab / CellOracle

This is the alpha version of the CellOracle package
Other
308 stars 56 forks source link

about the gene expression data used in the linear model fitting #31

Closed tingchenlrx closed 4 years ago

tingchenlrx commented 4 years ago

Hello, thank you a lot for the great tool! In your tutorial with the GRN recontrunction, you mentioned the following below:

In this notebook, we use raw mRNA count as an input of Oracle object.

adata.X = adata.raw.X.copy()

The scRNASeq data associated with "adata" was processed using scanpy. What you mean by "raw mRNA count" in "adata.raw.X" here is actually the normalized data before log-transformation, and adata.X keeps the log-transformed and scaled data. Am I correct?

In the paper, you mentioned that normalized data without log-transformation should be used for model fitting. Is adata.X (along with TF info) used to fit the linear model describing the regulatory relationship between target genes and TFs during the GRN reconstruction procedure?

I have a scRNASeq dataset processed using Seurat. I converted the Seurat object to AnnData using the data_conversion module in CellOracle. Let's say the adata_from_seurat is the AnnData from the Seurat object. adata_from_seurat.raw.X here keeps the raw counts (that are integers from the CellRanger pipeline), not normalized data. adata_from_seurat.X keeps the log-normalized data from "NormalizeData" function in Seurat. Before building the GRN, should I make adata_from_seurat.X save the normalized data without log transformation (i.e., feature counts for a feature in a cell / the number of total features counts in that cell*scale_factor)? Your kind reply is much appreciated.

KenjiKamimoto-ac commented 4 years ago

Hi, TC.

Thank you for using celloracle.

The answer to your question below is yes.

Before building the GRN, should I make adata_from_seurat.X save the normalized data without log transformation (i.e., feature counts for a feature in a cell / the number of total features counts in that cell*scale_factor)?

We made notebook for the scRNA-seq data processed with Seurat. Please try the notebook. https://github.com/morris-lab/CellOracle/blob/master/docs/notebooks/04_Network_analysis/Network_analysis_with_Seurat_data.ipynb

tingchenlrx commented 4 years ago

Thanks Kenji for the quick reply. I went to your notebook on network analysis with Seurat data. In this notebook, you wrote adata.X = adata.raw.X.copy(). But adata.raw.X here keeps the raw integer counts, not normalized data. I went to check your R and python code on the data_conversion module, and confirmed that adata.raw.X (after converting seurat object to anndata) did keep the integer counts (i.e., obj.seurat[["RNA"]]@counts). I am wondering if you should re-set in the tutorial adata.X to the normalized data (not-logged) instead of copying adata.raw.X to it.

Also, why do you use normalized data without log transformation for the linear modeling? Do you assume the normalized data is more likely normally distributed?

I tried running the network construction for a dataset containing 30,000 features on the cluster, and the time was tolerable. In that case, do you still recommend subsetting the anndata to the most variable features only (2000-3000 features as you recommended). You mentioned that subsetting the anndata will improve the accuracy of network construction, but I am not so clear about this. Could you explain more on the accuracy? For my understanding, if you choose fewer features (2000-3000), you are excluding possibly more TFs for a target gene or maybe more target genes in the linear modeling, which would definitely save the computation time.

Thanks so much for your time!

KenjiKamimoto-ac commented 4 years ago

Let me answer questions one by one.

(1) If you did normalization with "Seurat:: SCTransform", your adata.raw.X was derived from "seurat_object@assays$SCT@counts". This data was already normalized and do not need normalization again.

If you use "Seurat:: NormalizeData", you need normalization as you said. Please do normalization (but not log transformation) with adata.raw.X. I'm sorry that I forget to mention about this case because we usually use SCTransform.

(2) Before linear modeling, we DO log-normalization inside oracle object. When you load scRNA-seq data into oracle object, the log-normalization is automatically performed. The reason why we use non-log-transformed data for the input is just to avoid log-transformation multiple times.

(3) One of the important reasons why we need to select top 2000~3000 genes is to avoid getting false network edges. As you know, a scRNA-seq data generally includes many noize and dropouts. Even if a scRNA-seq data are made of many genes more than 2-3k, gene expression matrix are really sparse in most cases. In the worst case scenario, there are many genes witch consist of many zeros. It is really difficult, or almost impossible to make good model for such genes which consists of too many zeros. In general, there are only a relatively small number of gene which can be used for modeling. This is why we need to select features for modeling instead of using all features. In summary, even if It might be practically possible to use all features for GRN inference, we don't think we can make meaningful models for all genes.

Also, we can explain the reason why we select top 2000-3000 genes for celloracle analysis in a different way. Please remember that most scRNA-seq packages, such as Seurat and Scanpy, select top 2000~3000 genes before doing downstream analysis such as clustering and dimensional reduction. In general, such gene selection improves quality of analysis.

Of course, optimal number for gene selection might change depends on many conditions. You can use different number, for example 1000, or 4000 depending on your scRNA-seq data. We think 2000~3000 is good number to start with.

tingchenlrx commented 4 years ago

Thanks Kenji so much for the detailed explanation. Yes, I used Seurat::NormalizeData() to normalize the scRNASeq data. All is clear for me. Thanks again!