scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.93k stars 604 forks source link

Ingest won't integrate datasets of different lengths #2085

Open lstankiewicz15 opened 2 years ago

lstankiewicz15 commented 2 years ago

I am trying to ingest a CITEseq dataset into another clustered dataset. These datasets have different numbers of cells but I ran neighbors(n_neighbors=30) for both prior to running umap. I have confirmed that both datasets have the same variable names and the same number of variable names (38). Both objects look identical when a call adata.var.

I receive the error: "all input arrays must have the same shape".

sc.pp.neighbors(CODEX_sub, n_neighbors=30) 
sc.tl.umap(CODEX_sub)
sc.pp.neighbors(adata_sub, n_neighbors = 30)
sc.tl.umap(adata_sub)
sc.tl.ingest(CODEX_sub, adata_sub, obs='leiden', embedding_method='umap')
ValueError                                Traceback (most recent call last)
<ipython-input-214-01a03312d3df> in <module>
----> 1 sc.tl.ingest(CODEX_sub, adata_sub, obs='leiden', embedding_method='umap')

~\anaconda3\envs\scenv\lib\site-packages\scanpy\tools\_ingest.py in ingest(adata, adata_ref, obs, embedding_method, labeling_method, neighbors_key, inplace, **kwargs)
    124         labeling_method = labeling_method * len(obs)
    125 
--> 126     ing = Ingest(adata_ref, neighbors_key)
    127     ing.fit(adata)
    128 

~\anaconda3\envs\scenv\lib\site-packages\scanpy\tools\_ingest.py in __init__(self, adata, neighbors_key)
    383 
    384         if neighbors_key in adata.uns:
--> 385             self._init_neighbors(adata, neighbors_key)
    386         else:
    387             raise ValueError(

~\anaconda3\envs\scenv\lib\site-packages\scanpy\tools\_ingest.py in _init_neighbors(self, adata, neighbors_key)
    349         else:
    350             self._neigh_random_state = neighbors['params'].get('random_state', 0)
--> 351             self._init_pynndescent(neighbors['distances'])
    352 
    353     def _init_pca(self, adata):

~\anaconda3\envs\scenv\lib\site-packages\scanpy\tools\_ingest.py in _init_pynndescent(self, distances)
    284 
    285         first_col = np.arange(distances.shape[0])[:, None]
--> 286         init_indices = np.hstack((first_col, np.stack(distances.tolil().rows)))
    287 
    288         self._nnd_idx = NNDescent(

<__array_function__ internals> in stack(*args, **kwargs)

~\anaconda3\envs\scenv\lib\site-packages\numpy\core\shape_base.py in stack(arrays, axis, out)
    424     shapes = {arr.shape for arr in arrays}
    425     if len(shapes) != 1:
--> 426         raise ValueError('all input arrays must have the same shape')
    427 
    428     result_ndim = arrays[0].ndim + 1

ValueError: all input arrays must have the same shape

Versions

scanpy==1.7.0 anndata==0.7.6 umap==0.5.1 numpy==1.21.1 scipy==1.7.0 pandas==1.2.5 scikit-learn==0.24.2 statsmodels==0.12.2 python-igraph==0.9.8

Koncopd commented 2 years ago

Hi, i will check.

ViriatoII commented 2 years ago

Any update? I have the same issue with scanpy==1.8.1

The tutorial data has different row lengths and it works, so that can't be the problem

ViriatoII commented 2 years ago

To try to understand what's going on, I'm sharing here a comparison of the inputs being provided to the ingest function:

In the Tutorial, this is adata_ref:

And this is adata:

Both tutorial adatas after a successful ingest:


(AnnData object with n_obs × n_vars = 700 × 208
     obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain'
     var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
     uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap'
     obsm: 'X_pca', 'X_umap', 'rep'
     varm: 'PCs'
     obsp: 'distances', 'connectivities',
 AnnData object with n_obs × n_vars = 2638 × 208
     obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain'
     var: 'n_cells'
     uns: 'draw_graph', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap'
     obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_draw_graph_fr'
     varm: 'PCs'
     obsp: 'distances', 'connectivities')

Now my data, adata_ref:

and my adata that I wish to ingest:

  | louvain -- | -- cell1 | 0 cell2 | 0 ...

Both my adata files have the same 40 variables and pca/umaps, they look like this:

(AnnData object with n_obs × n_vars = 8989 × 40
     obs: 'louvain'
     uns: 'pca', 'neighbors', 'umap', 'louvain', 'louvain_colors'
     obsm: 'X_pca', 'X_umap'
     varm: 'PCs'
     obsp: 'distances', 'connectivities',
 AnnData object with n_obs × n_vars = 8439 × 40
     obs: 'celltype', 'louvain'
     uns: 'pca', 'neighbors', 'umap', 'louvain', 'louvain_colors'
     obsm: 'X_pca', 'X_umap'
     varm: 'PCs'
     obsp: 'distances', 'connectivities')

I suspect the error stems from the Nearest Neighbours. Or maybe my number of variables (40) is too small?

ViriatoII commented 2 years ago

This is the origin of the error:

Some of the cells don't have neighbours at the variable adata_ref.obsp['distances'].tolil().rows:

array([list([223, 280, 316, 5791]), list([3877, 5899, 7766, 7807]),
       list([165, 304, 423, 713]), ..., list([]),
       list([94, 865, 7077, 7666]), list([])], dtype=object)
## (the maximum 4 elements of each list comes from having run sc.pp.neighbors(adata_ref, n_neighbors = 5)) ##

The above array is impossible to stack with np.stack due to the sublists having different lengths

A potential solution might be filtering out those cells without neighbours, though this is suboptimal. I have tried it and new rows remain empty. Only after repeating it a second time, it works:

DEFINED_NEIGHB_NUM =5
sc.pp.pca(adata_ref)
sc.pp.neighbors(adata_ref, n_neighbors = DEFINED_NEIGHB_NUM )
sc.tl.umap(adata_ref)

b = np.array(list(map(len,adata_ref.obsp['distances'].tolil().rows))) == DEFINED_NEIGHB_NUM -1
adata_ref = adata_ref[b]

A better solution would be correcting the Nearest Neighbour assignment so that it doesn't create empty distance lists. Did I understand this correctly?

ViriatoII commented 2 years ago

Ok, I identified the root cause of the neighbor error and reported it as a bug:

https://github.com/scverse/scanpy/issues/2244

This probably happens because you have duplicated rows or cells with almost identical gene counts

zgyaru commented 2 years ago

Hi, I have the same issue with python 3.8.8, scanpy==1.8.1. I have fixed it by using the adata_ref object directly after the sc.pp.neighbors, and there were no filtration procedures after sc.pp.neighbors. Wish it will help!

I found that if I extract a subset of cells from adata_ref, the adata_ref.obsp['distances'] will be changed.

When I used the object that through filtering after sc.pp.neighbors, the length of items in adata_ref.obsp['distances'].tolil().rows were range from 34 to 41.

adata_ref = sc.read_h5ad('file_filteration_after_neighbors')    ## number of cells: 11262
pd.Series(map(len,adata_ref.obsp['distances'].tolil().rows)).value_counts()

41 104921 40 5188 39 801 38 229 37 88 36 23 35 11 34 1 dtype: int64`

When I used the object that directly after sc.pp.neighbors, the length of items in adata_ref.obsp['distances'].tolil().rows were all same as 41.

adata_ref = sc.read_h5ad('file_no_filteration_after_neighbors')  ## number of cells: 11646
pd.Series(map(len,adata_ref.obsp['distances'].tolil().rows)).value_counts()

41 111646 dtype: int64`

aribenjamin commented 1 year ago

This error message was hard to debug! Indeed it is due to the behavior of sc.pp.neighbors. Cells are sometimes given different numbers of neighbors. Sometimes that the errant cells have a number of neighbors greater than zero (unlike as seen in #2244, where the # of neighbors of some cells was 0).

My fix builds on the one above and was:

b = np.array(list(map(len, adata_ref.obsp['distances'].tolil().rows))) # number of neighbors of each cell
adata_ref_subset = adata_ref[np.where(b == DEFINED_NEIGHB_NUM-1)[0]] # select only those with the right number
sc.pp.neighbors(adata_ref_subset, DEFINED_NEIGHB_NUM) # rebuild the neighbor graph

@ViriatoII Your comment helped me fix things – but your fix itself didn't work for me.

First, when I use adata_ref = adata_ref[b], with b defined as above, it interprets b not as a boolean mask but as an index, returning a single cell duplicated over and over. I'm not sure what the intended behavior is here. My solution is to use adata_ref[np.where(b == n_neigh-1)[0]].

However, this subselection changes the number of neighbors that other cells have. For example, for my data,

b = np.array(list(map(len,adata_ref.obsp['distances'].tolil().rows)))
print("Before subselecting", np.unique(b, return_counts = True))

adata_ref_subset = adata_ref[np.where(b == DEFINED_NEIGHB_NUM-1)[0]]
c =  np.array(list(map(len,adata_ref_subset.obsp['distances'].tolil().rows)))
print("After subselecting", np.unique(c, return_counts = True))

yields

Before subselecting, (array([13, 14]), array([     28, 1161359]))
After subselecting, (array([10, 11, 12, 13, 14]),
  array([     15,       1,     633,      46, 1160664])))

To solve both problems I needed to rebuild the neighbor graph.

HelloWorldLTY commented 1 year ago

Hi all, did anyone find version with this bug fixed? Thanks.