martinjzhang / scDRS

Single-cell disease relevance score (scDRS)
https://martinjzhang.github.io/scDRS/
MIT License
98 stars 11 forks source link

Seurat to h5ad #44

Closed Dzhan4 closed 1 year ago

Dzhan4 commented 1 year ago

Hello,

Thank you for developing such a wonderful tool! I'm trying to use an existing scRNA dataset and converting from Seurat to h5ad. The dataset has raw data and normalized data through sctransform. When I convert the object to h5ad and run compute_score.py, I get an error:

python3 compute_score.py \
    --h5ad_file ${h5ad_file}\
    --h5ad_species human\
    --gs_file ${gs_file}\
    --gs_species human\
    --flag_filter True\
    --flag_raw_count True\
    --n_ctrl 1000\
    --flag_return_ctrl_raw_score False\
    --flag_return_ctrl_norm_score True\
    --out_folder ${out_dir}
 RuntimeWarning: invalid value encountered in log1p np.log1p(X, out=X)

I think this is because sctransform gives positive and negative values. My question is, should I just strip the Seurat object down to the raw RNA counts and make h5ad files from that? Is there a way to preserve sctransform normalized data in Seurat for downstream use? And lastly, should relevant information like cell type identity, umap embeddings, etc go into the covariates file separately?

Thanks for your help!

martinjzhang commented 1 year ago

Hi,

Thank you for the questions, and sorry for the late reply. Please see the answers below.

My question is, should I just strip the Seurat object down to the raw RNA counts and make h5ad files from that? Is there a way to preserve sctransform normalized data in Seurat for downstream use?

Our normalization procedure is very similar to sctransform so I think both are fine. If you use normalized data (e.g., by sctransform), you can set the --flag_raw_count False to prevent scDRS from normalizing the data again.

And lastly, should relevant information like cell type identity, umap embeddings, etc go into the covariates file separately?

.cov file should contain covariates that you wish to regress out, such as batch id or sequencing depth. Information like cell type identity, umap embeddings can go to adata.obs as columns. scDRS will not use the umap information. scDRS may use the cell type information in compute-score when the flag --adj_prop is specified or in perform-downstream for cell type-level association analysis.

Please let us know if you have more questions.

Dzhan4 commented 1 year ago

Thanks, this is very helpful!

Dzhan4 commented 1 year ago

Sorry to re-open this and thanks for all the responses. Setting --flag_raw_count False got rid of that initial error, but now I have a new error:

ValueError: h5ad expression matrix should not contain negative values. This is because in the preprocessing step, scDRS models the gene-level log mean-variance relationship. See scdrs.pp.compute_stats for details.

My code:

python3 ~/scDRS/bin/scdrs compute-score \
    --h5ad_file ${h5ad_file}\
    --h5ad_species human\
    --gs_file ${gs_file}\
    --gs_species human\
    --flag_filter True\
    --flag_raw_count False\
    --n_ctrl 1000\
    --flag_return_ctrl_raw_score False\
    --flag_return_ctrl_norm_score True\
    --out_folder ${out_dir}

Any ideas? This is a Seurat object converted to h5ad format which includes raw RNA counts as well as sctransform normalized data. I'm not familiar with h5ad format, unfortunately, but I used standard Seurat to h5ad conversion scripts from the creators of Seurat.

martinjzhang commented 1 year ago

Hi,

Thank you for reporting the issue. Could you provide a minimal reproducible example so that I can look further into it? For example, a smaller version of your .h5ad file along with the script that can reproduce the issue.

KangchengHou commented 1 year ago

@martinjzhang We recently add the line of https://github.com/martinjzhang/scDRS/blob/5e490d62c0f169bb9e2893e0e91c90557fd9c86e/scdrs/util.py#L74 So that no negative values is allowed in raw matrix (even if flag-raw-count is False). This is because in estimating technical variance, we first convert back to count scale using exp(X) - 1 https://github.com/martinjzhang/scDRS/blob/5e490d62c0f169bb9e2893e0e91c90557fd9c86e/scdrs/pp.py#L358

probably there is someway to avoid this step?

martinjzhang commented 1 year ago

Hi @KangchengHou , thank you for the comment. It indeed seems the issue is because @Dzhan4's normalized data contains negative values. size factor + log(X+1) transformation in Seurat should not incur negative values, but covariate adjustment will.

probably there is someway to avoid this step?

I am not sure if we can skip it entirely, because we need to convert the data back-and-forth between the original and log space to perform covariate adjustment (log space) and estimate the mean-variance relationship. However, the code can be improved. Specifically, negative values (due to covariate adjustment in the log space for lowly-expressed genes) may go into np.expm1 in the code below, which is currently handled internally in the numpy function. https://github.com/martinjzhang/scDRS/blob/7d935e70e6a9604b3ebb3b51e8835f0d53848f7c/scdrs/pp.py#L367-L373

martinjzhang commented 1 year ago

Hi @Dzhan4 ,

In light of @KangchengHou 's comment, I now change my recommendation to the following:

My question is, should I just strip the Seurat object down to the raw RNA counts and make h5ad files from that? Is there a way to preserve sctransform normalized data in Seurat for downstream use?

Our normalization procedure is very similar to sctransform. I suggest creating the .h5ad file using the raw RNA counts.

Sorry for the trouble this change may have caused you. Please let us know if you have more questions.