martinjzhang / scDRS

Single-cell disease relevance score (scDRS)
https://martinjzhang.github.io/scDRS/
MIT License
98 stars 11 forks source link

Convert Seurat object to h5ad file #59

Open uqzqiao opened 1 year ago

uqzqiao commented 1 year ago

Hi!

Our scRNA-seq data was processed using the Seurat package. In order to run scDRS with our scRNA-seq data, I first converted the Seurat object saved as an RDS file to an h5ad file using the following script,


seurat_counts <- seurat_obj[["RNA"]]@counts
  seurat_metadata <- seurat_obj@meta.data
  rownames(seurat_metadata) <- colnames(seurat_counts)

  adata <- anndata$AnnData(
    X = as.matrix(t(as.matrix(seurat_counts))),
    obs = as.data.frame(seurat_metadata)
  )
 return(adata)
}

adata <- seurat_to_anndata(df)
write_h5ad(adata, "raw.h5ad")

However, when I tried to use this h5ad file for the compute-score step, I encountered the following error,

Traceback (most recent call last):
  File "$HOME/miniconda3/envs/scRNA-seq/bin/scdrs", line 740, in <module>
    fire.Fire()
  File "$HOME/miniconda3/envs/scRNA-seq/lib/python3.11/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "$HOME/miniconda3/envs/scRNA-seq/lib/python3.11/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "$HOME/miniconda3/envs/scRNA-seq/lib/python3.11/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "$HOME/miniconda3/envs/scRNA-seq/bin/scdrs", line 211, in compute_score
    scdrs.preprocess(
  File "~/softwares/scDRS/scdrs/pp.py", line 200, in preprocess
    v_resid = reg_out(np.ones(n_cell), df_cov.values)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/softwares/scDRS/scdrs/pp.py", line 453, in reg_out
    mat_coef = np.linalg.solve(mat_xtx, mat_xty)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<__array_function__ internals>", line 200, in solve
  File "$HOME/miniconda3/envs/scRNA-seq/lib/python3.11/site-packages/numpy/linalg/linalg.py", line 386, in solve
    r = gufunc(a, b, signature=signature, extobj=extobj)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
numpy.core._exceptions._UFuncInputCastingError: Cannot cast ufunc 'solve' input 0 from dtype('O') to dtype('float64') with casting rule 'same_kind'

I'm more experienced with R than Python, so I was hoping you could help me suggest a better way to convert the Seurat object to an h5ad file. Thanks so much!

uqzqiao commented 1 year ago

Update:

I've attempted to transform the Seurat object to an h5ad file using another R package, but unfortunately encountered the same error. This leads me to believe that the problem may be due to the existence of missing values (NA) in the expression data. However, after checking the data, I found that there are no NAs. Can you please provide some suggestions on other potential issues that could be causing this problem? Thank you very much!

# Load packages
library(Seurat)
library(SeuratData)
library(SeuratDisk)

# df is the seurat object
test=UpdateSeuratObject(df)
SaveH5Seurat(test, filename="raw.h5Seurat")
Convert("raw.h5Seurat", dest="h5ad")
martinjzhang commented 1 year ago

Hi @uqzqiao

Thank you for reporting the issue. It seems your .h5ad file is fine and the error is with the preprocessing step within scdrs. The problem might be that your --cov file has a different data type, hence raising the Cannot cast ufunc 'solve' error.

I couldn't locate the exact issue. Being more careful with input data types on your side or forcing the data to have the same type on the software side may solve the problem.

If you can provide us with a small set of input files that can reproduce this error, I can look into the issue and get back to you in a few days.

Best, Martin

uqzqiao commented 1 year ago

Hi @martinjzhang,

Thank you so much for your response! I really appreciate your help! I noticed that the error I encountered was due to NAs in the covariate file, which was an oversight in my preparation process. I am sorry for any inconvenience this may have caused.

I do have a couple more questions regarding the software (I apologize for posting my additional questions in this thread, as they may be better suited for a separate discussion),

Thank you again for your time and help!

Best regards, Zhen

uqzqiao commented 1 year ago

Hi Martin,

I apologize for asking so many questions, but I'm looking for some advice on how to proceed if no cells are significant at certain FDR thresholds (0.1 or 0.2). I'm aware that FDR correction may not effectively retain true positives if there are a large number of tests needed to be corrected for (e.g., I got 1.26m cells). In this case, would it be appropriate to compute a score for each cell type to identify any cells associated with disease? Additionally, if no cells are significant at certain FDR thresholds, would it still be worthwhile to proceed with downstream analysis? Thank you again for your help!

Best wishes, Zhen

martinjzhang commented 1 year ago

Hi @uqzqiao

Great that you can successfully run the software. Please see my suggestions below.

Firstly, I came across a closed issue where you mentioned that you removed the MHC region (CHR6 25-34MB) and restricted the summary statistics to the HAPMAP3 SNPs for the manuscript. I am wondering about the impact of this step on the results, since the disease I am studying may have disease-associated genes in the HLA region. Additionally, I would like to know if restricting to HapMap3 SNPs is for computational benefits or if it affects the accuracy of generating disease-specific gene sets.

The MHC region has a complex LD structure. That's why we excluded this region. scDRS is compatible with any disease gene set. So if you can curate a set of disease genes using other software, that would work too.

If you focus on signals of common SNPs (e.g., >5%), HapMap3 SNPs should capture most of the GWAS information. I am not sure about rare SNPs. It may be helpful to include other SNPs if you care about rare variant signals.

Secondly, our preliminary analysis suggests that adjusting for pool can better account for batch effect. Thus, I plan to include pool as a categorical variable in the covariate file. However, we have over 70 pools from 2 batches. Would you recommend adjusting for pool or batch in this case?

You can encode these covariates as dummy variables.

I apologize for asking so many questions, but I'm looking for some advice on how to proceed if no cells are significant at certain FDR thresholds (0.1 or 0.2). I'm aware that FDR correction may not effectively retain true positives if there are a large number of tests needed to be corrected for (e.g., I got 1.26m cells). In this case, would it be appropriate to compute a score for each cell type to identify any cells associated with disease? Additionally, if no cells are significant at certain FDR thresholds, would it still be worthwhile to proceed with downstream analysis? Thank you again for your help!

Even if no cells are significant individually, they may yield significant results in the group-level analysis. I suggest creating a UMAP visualization with color representing the normalized scDRS disease score. If you can notice any pattern, such as certain groups of cells having much greater scDRS disease scores, then it indicates that scDRS is likely capturing some disease signals.

Please feel free to let us know if you have more questions.

Best, Martin

uqzqiao commented 1 year ago

Hi Martin,

Thanks so much for your reply! It has been incredibly helpful!! I've got some interesting results and I'm currently working on understanding and interpreting them. I have a couple more questions:

Could you please explain the methods used to correct for multiple testing (regarding the individual cell disease relevance scores)? If it is BH, would it be appropriate for me to recalculate the number of significant cells with alternative methods, using the scDRS p-values (pval) from the score file?

In the downstream analyses (--corr-analysis) at the individual cell level, would it still be beneficial to perform this analysis if the number of significant cells is limited? Additionally, if I only have individual cell-level variables for a subset of all cells (with the rest being NA), will it still be possible to calculate the correlation parameter (na.rm=T)?

I am particularly interested in the prioritized disease-relevant genes obtained from the --gene-analysis. However, I only have data for one specific trait/disease, and it is challenging to curate a reference gene set. Do you have any suggestions on how to interpret the prioritized genes in this context?

Thank you very much for your help!

Best regards, Zhen

martinjzhang commented 1 year ago

Could you please explain the methods used to correct for multiple testing (regarding the individual cell disease relevance scores)? If it is BH, would it be appropriate for me to recalculate the number of significant cells with alternative methods, using the scDRS p-values (pval) from the score file?

Yes, we used BH. The pval in the score files are raw p-values. You can apply any multiple testing methods on them.

In the downstream analyses (--corr-analysis) at the individual cell level, would it still be beneficial to perform this analysis if the number of significant cells is limited?

Yes. Similar to the group-level analysis, the correlation analysis may be more powerful than the individual cell-level association analysis by pooling information across cells.

Additionally, if I only have individual cell-level variables for a subset of all cells (with the rest being NA), will it still be possible to calculate the correlation parameter (na.rm=T)?

scDRS may give you NA results because it computes the correlation using all overlapping cells between adata and the score file. One way to get around this is to create smaller score files restricted to only cells with the cell-level variable values.

I am particularly interested in the prioritized disease-relevant genes obtained from the --gene-analysis. However, I only have data for one specific trait/disease, and it is challenging to curate a reference gene set. Do you have any suggestions on how to interpret the prioritized genes in this context?

Have you looked at the drug targets in https://www.opentargets.org/

jjzixue commented 11 months ago

Hi,@martinjzhang I encountered difficulties while converting Seurat to the h5ad file format. Due to my limited familiarity with Python, I couldn't resolve the issue on my own. I would greatly appreciate your assistance.

example data fromzeisel_2015/expr.h5ad for the analysis. After splitting and reorganizing the expr.h5ad file into merged_data.h5ad using R and Python code, the subsequent scRDS analysis with the split file is not yielding the expected results.

##################
#R code:
##################
a1 <- ReadH5AD("expr.h5ad")

umap_coords <- Embeddings(a1, "umap")
write.csv(umap_coords, "umap_coords.csv")

seurat_object <- a1
# Convert Seurat object to SingleCellExperiment object
sce <- as.SingleCellExperiment(seurat_object)

# Export UMAP and PCA coordinates
umap_coords <- Embeddings(seurat_object, "umap")
pca_coords <- Embeddings(seurat_object, "pca")
write.csv(umap_coords, "umap_coords.csv")
write.csv(pca_coords, "pca_coords.csv")

# Export metadata
metadata <- seurat_object@meta.data
write.csv(metadata, "metadata.csv")

# Export assay data (replace "assay_name" with the name of the assay you want to export)
assay_name <- "RNA"
assay_data <- seurat_object@assays$RNA@counts  # Replace "RNA" with the correct assay name

write.csv(assay_data, paste0(assay_name, "_data.csv”))

##################
#python code:
##################
import pandas as pd
import anndata
import os
os.chdir('~/scDRS')
os.getcwd() 
# Load UMAP and PCA coordinates from CSV files
umap_coords = pd.read_csv("umap_coords.csv", index_col=0)
pca_coords = pd.read_csv("pca_coords.csv", index_col=0)
# Load metadata from CSV file
metadata = pd.read_csv("metadata.csv", index_col=0)

# Load assays data from CSV file (replace "assay_data.csv" with the actual file name)
assay_data = pd.read_csv("RNA_data.csv", index_col=0)

# Create an AnnData object
adata = anndata.AnnData(X=assay_data.values.T, obs=metadata)
adata.obsm["X_umap"] = umap_coords.values
adata.obsm["X_pca"] = pca_coords.values

# Save the AnnData object as h5ad file
output_h5ad_path = "merged_data.h5ad"
adata.write(output_h5ad_path)

print("AnnData object saved as h5ad format.”)

##################
#merged_data.h5ad
##################
scdrs compute-score \
  --h5ad-file ~/scDRS/data/merged_data.h5ad \
  --h5ad-species mouse \
  --gs-file ~/scDRS/data/processed_geneset.gs \
  --gs-species mouse \
  --cov-file ~/scDRS/data/cov.tsv \
  --flag-filter-data True \
  --flag-raw-count True \
  --flag-return-ctrl-raw-score False \
  --flag-return-ctrl-norm-score True \
  --out-folder ~/scDRS/data/
******************************************************************************
* Single-cell disease relevance score (scDRS)
* Version 1.0.3
* Martin Jinye Zhang and Kangcheng Hou
* HSPH / Broad Institute / UCLA
* MIT License
******************************************************************************
Call: scdrs compute-score \
--h5ad-file ~/scDRS/data/merged_data.h5ad \
--h5ad-species mouse \
--cov-file ~/scDRS/data/cov.tsv \
--gs-file ~/scDRS/data/processed_geneset.gs \
--gs-species mouse \
--ctrl-match-opt mean_var \
--weight-opt vs \
--adj-prop None \
--flag-filter-data True \
--flag-raw-count True \
--n-ctrl 1000 \
--min-genes 250 \
--min-cells 50 \
--flag-return-ctrl-raw-score False \
--flag-return-ctrl-norm-score True \
--out-folder ~/scDRS/data/

Loading data:
--h5ad-file loaded: n_cell=3005, n_gene=13572 (sys_time=1.7s)
First 3 cells: ['1772071015_C02', '1772071017_G12', '1772071017_A05']
First 5 genes: ['0', '1', '2', '3', '4']     ########################## different to expr.h5ad
--cov-file loaded: covariates=['const', 'n_genes'] (sys_time=1.7s)
n_cell=3005 (3005 in .h5ad)
First 3 cells: ['1772071015_C02', '1772071017_G12', '1772071017_A05']
First 5 values for 'const': [1, 1, 1, 1, 1]
First 5 values for 'n_genes': [4848, 4685, 6028, 5824, 4701]
--gs-file loaded: n_trait=3 (sys_time=1.7s)
Print info for first 3 traits:
First 3 elements for 'SCZ': [], []    ########################## different to expr.h5ad
First 3 elements for 'Dorsal': [], [] ########################## different to expr.h5ad
First 3 elements for 'Height': [], [] ########################## different to expr.h5ad

Preprocessing:

Computing scDRS score:
trait=SCZ: skipped due to small size (n_gene=0, sys_time=3.8s)
trait=Dorsal: skipped due to small size (n_gene=0, sys_time=3.8s)
trait=Height: skipped due to small size (n_gene=0, sys_time=3.8s)

However, during the scRDS analysis, only the expr.h5ad file is able to complete successfully.

##################
#expr.h5ad
##################
scdrs compute-score \
  --h5ad-file ~/scDRS/data/expr.h5ad \       
  --h5ad-species mouse \
  --gs-file ~/scDRS/data/processed_geneset.gs \
  --gs-species mouse \
  --cov-file ~/scDRS/data/cov.tsv \
  --flag-filter-data True \
  --flag-raw-count True \
  --flag-return-ctrl-raw-score False \
  --flag-return-ctrl-norm-score True \
  --out-folder ~/scDRS/data/

******************************************************************************
* Single-cell disease relevance score (scDRS)
* Version 1.0.3
* Martin Jinye Zhang and Kangcheng Hou
* HSPH / Broad Institute / UCLA
* MIT License
******************************************************************************
Call: scdrs compute-score \
--h5ad-file ~/scDRS/data/expr.h5ad \
--h5ad-species mouse \
--cov-file ~/scDRS/data/cov.tsv \
--gs-file ~/scDRS/data/processed_geneset.gs \
--gs-species mouse \
--ctrl-match-opt mean_var \
--weight-opt vs \
--adj-prop None \
--flag-filter-data True \
--flag-raw-count True \
--n-ctrl 1000 \
--min-genes 250 \
--min-cells 50 \
--flag-return-ctrl-raw-score False \
--flag-return-ctrl-norm-score True \
--out-folder ~/scDRS/data/

Loading data:
--h5ad-file loaded: n_cell=3005, n_gene=13572 (sys_time=1.9s)
First 3 cells: ['1772071015_C02', '1772071017_G12', '1772071017_A05']
First 5 genes: ['Tspan12', 'Tshz1', 'Fnbp1l', 'Adamts15', 'Cldn12']    ########################## run  normal
--cov-file loaded: covariates=['const', 'n_genes'] (sys_time=1.9s)
n_cell=3005 (3005 in .h5ad)
First 3 cells: ['1772071015_C02', '1772071017_G12', '1772071017_A05']
First 5 values for 'const': [1, 1, 1, 1, 1]
First 5 values for 'n_genes': [4848, 4685, 6028, 5824, 4701]
--gs-file loaded: n_trait=3 (sys_time=1.9s)
Print info for first 3 traits:
########################## run normal
First 3 elements for 'SCZ': ['Dpyd', 'Cacna1i', 'Rbfox1'], [7.6519, 7.4766, 7.3247]  
First 3 elements for 'Dorsal': ['Ckb', 'Fndc5', 'Lin7b'], [1.0, 1.0, 1.0]
First 3 elements for 'Height': ['Wwox', 'Gmds', 'Lpp'], [10.0, 10.0, 9.9916]

Preprocessing:

Computing scDRS score:
Computing control scores: 100%|█████████████████████████████████████| 1000/1000 [00:47<00:00, 20.95it/s]
Trait=SCZ, n_gene=756: 178/3005 FDR<0.1 cells, 320/3005 FDR<0.2 cells (sys_time=83.9s)
Computing control scores: 100%|█████████████████████████████████████| 1000/1000 [00:17<00:00, 58.77it/s]
Trait=Dorsal, n_gene=155: 549/3005 FDR<0.1 cells, 641/3005 FDR<0.2 cells (sys_time=123.1s)
Computing control scores: 100%|█████████████████████████████████████| 1000/1000 [00:37<00:00, 26.75it/s]
Trait=Height, n_gene=671: 0/3005 FDR<0.1 cells, 6/3005 FDR<0.2 cells (sys_time=184.7s)
martinjzhang commented 11 months ago

Hi @jjzixue ,

In your new data file, gene names become numbers. As a result, scDRS can not match genes in the .gs file with genes in the .h5ad file.

First 5 genes: ['0', '1', '2', '3', '4']     ########################## different to expr.h5ad

Let me know if you need further help.

jjzixue commented 11 months ago

Hi @martinjzhang , Thank you very much for your prompt response. After reading the files using R, I discovered that the gene names are missing. I'm unsure which part of the code during merging might be causing this issue. Could you help me review and identify any potential bugs in the code?

TakuroIwami commented 2 weeks ago

Hi @martinjzhang ,

Thank you for providing us with such a wonderful package. And I apologize for my lack of knowledge. I conducted snRNA-seq using multiple samples and analyzed the data using Seurat(since I am more familiar with R than python). I encountered a similar issue when converting the Seurat object file to a h5ad file. Although I would like to use the integrated assay data to avoid the batch effect, the integrated assay in Seurat didn't include count data.

So my question is whether it would be acceptable to use the raw count data for multiple samples under the same conditions? Adding batch information into a covariate file could be one of the solutions?

Sincerely, Takuro

martinjzhang commented 2 weeks ago

Hi @TakuroIwami , that's fine.

TakuroIwami commented 1 week ago

Hi @martinjzhang ,

Thank you for your response. In line with the previous question, I have one more question. I had two types of data ("data","scale.data") in my original Seurat object file. After integrating multiple samples datasets, both types of data contained negative values, which was not the case with the raw expression data. As mentioned in another thread, "scDRS compute_score" doesn't typically accept the h5ad file including negative values. Therefore, I manually added the same value to every "scale.data" matrix to ensure all data were not negative(eg; changing from mean0,var1→mean k, var1), and the analysis proceeded successfully. However, I am concerned about whether such a modification was appropriate. I would appreciate your guidance on this matter.

martinjzhang commented 6 days ago

Hi @TakuroIwami ,

The results are likely still interpretable but this is not scDRS's intended use. If you have multiple datasets, the following solution from your last msg is recommended:

So my question is whether it would be acceptable to use the raw count data for multiple samples under the same conditions? Adding batch information into a covariate file could be one of the solutions?

I suggest also performing this analysis and show the results are consistent.

Martin

TakuroIwami commented 6 days ago

Hi @martinjzhang ,

Thank you for your quick response. I performed analysis based on your comment, and the results appeared to be consistent. I appreciate your advice. I would like to confirm once again (sorry) whether it is acceptable to use integrated datasets after computing disease scores by "compute_score.py" script. This is because when creating a plot to show which cells from which clusters have high scores for example, I believe it would be better for the base UMAP plot to be calculated from integrated datasets, as batch effect correction has already been performed on them.

martinjzhang commented 4 days ago

yes, it's fine that the UMAP and scDRS disease scores are computed using different versions of datasets.

TakuroIwami commented 4 days ago

HI @martinjzhang ,

Thank you for your answer. The analysis seemed to be ok with your tremendous support. I greatly appreciate it.