theislab / scvelo

RNA Velocity generalized through dynamical modeling
https://scvelo.org
BSD 3-Clause "New" or "Revised" License
408 stars 103 forks source link

Question about combining AnnData objects for use in scVelo #1170

Closed jwalewski closed 6 months ago

jwalewski commented 8 months ago

Hello,

I am attempting to combine anndata objects that are outputted from Seurat (which contains the UMAP scRNAseq clusterting) and Velocyto (which contains the count matrices) so that I can perform RNA velocity analysis on cells while also knowing their IDs.

I have been trying this for a little while and wanted some advice on how to proceed.

Here's the underlying structure of the seurat anndata:

This is adata_seurat:  AnnData object with n_obs × n_vars = 5355 × 2000
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'log10GenesPerUMI', 'mitoRatio', 'nUMI', 'nGene', 'S.Score', 'G2M.Score', 'Phase', 'mitoFr', 'RNA_snn_res.0.4', 'RNA_snn_res.0.6', 'RNA_snn_res.0.8', 'RNA_snn_res.1', 'RNA_snn_res.1.4', 'seurat_clusters'
    var: 'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable'
    uns: 'neighbors'
    obsm: 'X_pca', 'X_tsne', 'X_umap'
    varm: 'PCs'
    obsp: 'distances'

And here is the underlying structure of the velocyto anndata:

This is adata_velocyto:  AnnData object with n_obs × n_vars = 5400 × 60668
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand'
    layers: 'ambiguous', 'matrix', 'spliced', 'unspliced'

Lastly, here's the layout of the example Pancreas dataset for the RNA velocity tutorial:

AnnData object with n_obs × n_vars = 2531 × 27998
    obs: 'day', 'proliferation', 'G2M_score', 'S_score', 'phase', 'clusters_coarse', 'clusters', 'clusters_fine', 'louvain_Alpha', 'louvain_Beta', 'palantir_pseudotime'
    var: 'highly_variable_genes'
    uns: 'clusters_colors', 'clusters_fine_colors', 'day_colors', 'louvain_Alpha_colors', 'louvain_Beta_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    layers: 'spliced', 'unspliced'
    obsp: 'connectivities', 'distances'

To my understanding, it then seems that I need an object that contains:

Do I need other features?

Additionally then, what would be the best way to proceed with the merging of the objects? There seem to be 45 cells that did not pass our QC in seurat, so I imagine subsetting out the vast majority that did from the velocyto object is necessary.

Then, from the subsetted veloctyo object, should I try to extract its features and copy them to the seurat object, or vice versa? I imagine it may be a little easier to copy the smaller seurat object onto the velocyto object, especially since I assume we are working on the data present in the velocyto.X.

However, I still don't fully understand why there's a vast difference in the number of variables seen. What should I do for all of the variables which don't have data in the seurat object? Or do I not need to worry about it since I may only need to add observations (and observation matricies) to the velocyto object?

Thank you so much for your continued help in this process. scVelo seems like a very powerful and exciting tool to work with!