scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
578 stars 154 forks source link

Bug: Contradictory information on the appropriate shape of data #1251

Closed jwalewski closed 11 months ago

jwalewski commented 11 months ago

Please make sure these conditions are met

Report

Hello,

My apologies if this is a trivial question, but I am a new anndata user attempting to combine two anndata objects (one generated in seurat, and another with velocyto). While doing this, I am attempting to copy UMAP coordinates from the Seurat object to the velocyto object, and I keep getting an error that seemingly contradicts itself on the shape of the data I should be inputting.

Also, I've listed the versions of everything. I tried to install the most recent versions, but I cannot install them on the cluster I am working on (pip can't find it).

Please let me know if you have any further questions.

All the best,

Joe

Code:

#!/bin/bash

#SBATCH -p scavenge --mem=240G
#SBATCH -c 20
#SBATCH -t 24:00:00
#SBATCH --mail-type=ALL
##SBATCH --mail-user=[my user]
ml Python/3.8.6-GCCcore-10.2.0
python3.8 -c '
# ------- 1. installing packages common to all vignettes -------
print("Step 2: cellrank preprocessing has begun")
import numpy as np
import sys
import pandas as pd
import cellrank as cr
import scvelo as scv
import scanpy as sc
import warnings
import re
#from cellrank.tl.kernels import Kernel
warnings.simplefilter("ignore", category=UserWarning)

## ------- 2. Configuring basic packages -------
scv.settings.verbosity = 3
sc.settings.set_figure_params("scvelo") #these may change as we wish
cr.settings.verbosity = 2
warnings.simplefilter("ignore", category=UserWarning)

# ------- 3. read input data from the h5ad file generated by seurat and the loom file generated from velocyto  -------
seurat_output_path="/gpfs/gibbs/pi/hafler/jw2894/Data/RNA_Velocity_Pipeline/H5AD_Files/"        
seurat_output_name="BLOOD_TEST.h5ad"                                       #Aagin, this will soon change to be a shell argument (so I can pass it in via $files from find)
seurat_output_path+=seurat_output_name
velocyto_output_path="/gpfs/gibbs/project/hafler/jw2894/200427_Blood_GEX/velocyto/"
velocyto_output_name="200427_Blood_GEX.loom"
velocyto_output_path+=velocyto_output_name
adata_seurat = scv.read(seurat_output_path, cache=True) #This is the path to the Seurat clustering output
adata_velocyto = scv.read(velocyto_output_path, cache=True) #This is the path to the Count matricies generated by Velocyto
# Loading in test data & confirming
#scv.pl.proportions(adata)                      #Commented out for now (12/5) since it is causing a crash
print("This is adata_seurat: ", adata_seurat)
print("This is adata_velocyto: ", adata_velocyto)

# -------- 4. Merging together the anndata objects --------

#Part one: merging together the obseration layers

#This step is necessary as the seurat anndata object has already had QC measures to filter out low quality cells, while the velocyto anndata object has not. 
#Therefore, the first step in this process is subsetting out the cells in common (IE the ones that have passed seurats quality test and have count matrix data)
#Example name instance for seurat cell: AAACCTGAGCAGGCTA-1
#Example name instance for velocyto cell: 200427_Blood_GEX:AACCGCGTCAAACCGTx

#Match this via regex on seurat name to identify only A-Z characters and then parse velocyto name for matching substring
seurat_obs_names = adata_seurat.obs_names

# Define a regex pattern to match letters
letter_pattern = r"[A-Z]+"

# Use regex to extract letters
seurat_letters = [re.search(letter_pattern, name).group(0) for name in seurat_obs_names]

# #Subsetting the seurat dataset
# #Next, we need to take the ideal UMAP resolution (agreed to be .4 for now)
 umap_clusters = adata_seurat.obs["RNA_snn_res.0.4"]            

# #Its possible we may need to update UMAP coordinates
# #Create entry to store UMAP coordinates in
adata_velocyto.obsm["X_umap"] = np.array(adata_velocyto.obsm.get("X_umap", []))

# # Create a dictionary mapping observation names to UMAP coordinates
umap_dict = dict(zip(adata_seurat.obs_names, adata_seurat.obsm["X_umap"]))
# # Update UMAP coordinates in adata_velocyto
# # Choose a specific name from the Velocyto dataset
for name in matching_velocyto_obs_names:
     modified_name = re.sub(r".*:", "", name) #Remove the colon and everything before
     modified_name = re.sub(r"x", "", modified_name) #remove the lowercase x at the end
     modified_name += "-1"
     adata_velocyto.obsm["X_umap"].append(umap_dict[modified_name]) #Error is here

...

Traceback:

  File "<string>", line 122, in <module>
  File "/home/jw2894/.local/lib/python3.8/site-packages/anndata/_core/aligned_mapping.py", line 181, in __setitem__
    value = self._validate_value(value, key)
  File "/home/jw2894/.local/lib/python3.8/site-packages/anndata/_core/aligned_mapping.py", line 245, in _validate_value
    return super()._validate_value(val, key)
  File "/home/jw2894/.local/lib/python3.8/site-packages/anndata/_core/aligned_mapping.py", line 77, in _validate_value
    raise ValueError(
ValueError: Value passed for key 'X_umap' is of incorrect shape. Values of obsm must match dimensions (0,) of parent. Value had shape (0,) while it should have had (5400,).

Versions

>>> import anndata, session_info; session_info.show(html=False, dependencies=True)
-----
anndata             0.9.2
pip                 20.2.3
session_info        1.0.0
-----
cython_runtime      NA
dateutil            2.8.2
h5py                3.10.0
mpl_toolkits        NA
natsort             8.4.0
numpy               1.24.4
packaging           20.4
pandas              2.0.3
pkg_resources       NA
pytz                2020.1
scipy               1.10.1
setuptools_scm      NA
sitecustomize       NA
six                 1.15.0
sphinxcontrib       NA
-----
Python 3.8.6 (default, Apr 21 2023, 14:59:47) [GCC 10.2.0]
Linux-4.18.0-477.27.1.el8_8.x86_64-x86_64-with-glibc2.2.5
-----
Session information updated at 2023-12-07 17:05
flying-sheep commented 11 months ago

Right, this could be phrased better. “must match dimensions (0,)” refers to the axis with index 0, i.e. shape[0].

jwalewski commented 11 months ago

Hello,

I really appreciate you taking the time to update and test everything so rigorously. I just installed the main branch (via pip install git+https://github.com/scverse/anndata.git@main#egg=anndata) and I am still running into the same issue. Is this intended for now? Is there some way I could rewrite my code to make it compatible?

Many thanks for your help!