Open lstankiewicz15 opened 2 years ago
Hi, i will check.
Any update? I have the same issue with scanpy==1.8.1
The tutorial data has different row lengths and it works, so that can't be the problem
To try to understand what's going on, I'm sharing here a comparison of the inputs being provided to the ingest function:
In the Tutorial, this is adata_ref:
2638 rows × 4 columns
And this is adata:
700 rows × 8 columns
Both tutorial adatas after a successful ingest:
(AnnData object with n_obs × n_vars = 700 × 208
obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain'
var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap'
obsm: 'X_pca', 'X_umap', 'rep'
varm: 'PCs'
obsp: 'distances', 'connectivities',
AnnData object with n_obs × n_vars = 2638 × 208
obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain'
var: 'n_cells'
uns: 'draw_graph', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap'
obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_draw_graph_fr'
varm: 'PCs'
obsp: 'distances', 'connectivities')
Now my data, adata_ref:
8439 rows × 2 columns
and my adata that I wish to ingest:
| louvain -- | -- cell1 | 0 cell2 | 0 ...Both my adata files have the same 40 variables and pca/umaps, they look like this:
(AnnData object with n_obs × n_vars = 8989 × 40
obs: 'louvain'
uns: 'pca', 'neighbors', 'umap', 'louvain', 'louvain_colors'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'distances', 'connectivities',
AnnData object with n_obs × n_vars = 8439 × 40
obs: 'celltype', 'louvain'
uns: 'pca', 'neighbors', 'umap', 'louvain', 'louvain_colors'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'distances', 'connectivities')
I suspect the error stems from the Nearest Neighbours. Or maybe my number of variables (40) is too small?
This is the origin of the error:
Some of the cells don't have neighbours at the variable adata_ref.obsp['distances'].tolil().rows:
array([list([223, 280, 316, 5791]), list([3877, 5899, 7766, 7807]),
list([165, 304, 423, 713]), ..., list([]),
list([94, 865, 7077, 7666]), list([])], dtype=object)
## (the maximum 4 elements of each list comes from having run sc.pp.neighbors(adata_ref, n_neighbors = 5)) ##
The above array is impossible to stack with np.stack due to the sublists having different lengths
A potential solution might be filtering out those cells without neighbours, though this is suboptimal. I have tried it and new rows remain empty. Only after repeating it a second time, it works:
DEFINED_NEIGHB_NUM =5
sc.pp.pca(adata_ref)
sc.pp.neighbors(adata_ref, n_neighbors = DEFINED_NEIGHB_NUM )
sc.tl.umap(adata_ref)
b = np.array(list(map(len,adata_ref.obsp['distances'].tolil().rows))) == DEFINED_NEIGHB_NUM -1
adata_ref = adata_ref[b]
A better solution would be correcting the Nearest Neighbour assignment so that it doesn't create empty distance lists. Did I understand this correctly?
Ok, I identified the root cause of the neighbor error and reported it as a bug:
https://github.com/scverse/scanpy/issues/2244
This probably happens because you have duplicated rows or cells with almost identical gene counts
Hi, I have the same issue with python 3.8.8, scanpy==1.8.1.
I have fixed it by using the adata_ref
object directly after the sc.pp.neighbors
, and there were no filtration procedures after sc.pp.neighbors
. Wish it will help!
I found that if I extract a subset of cells from adata_ref
, the adata_ref.obsp['distances']
will be changed.
When I used the object that through filtering after sc.pp.neighbors
, the length of items in adata_ref.obsp['distances'].tolil().rows
were range from 34 to 41.
adata_ref = sc.read_h5ad('file_filteration_after_neighbors') ## number of cells: 11262
pd.Series(map(len,adata_ref.obsp['distances'].tolil().rows)).value_counts()
41 104921 40 5188 39 801 38 229 37 88 36 23 35 11 34 1 dtype: int64`
When I used the object that directly after sc.pp.neighbors
, the length of items in adata_ref.obsp['distances'].tolil().rows
were all same as 41.
adata_ref = sc.read_h5ad('file_no_filteration_after_neighbors') ## number of cells: 11646
pd.Series(map(len,adata_ref.obsp['distances'].tolil().rows)).value_counts()
41 111646 dtype: int64`
This error message was hard to debug! Indeed it is due to the behavior of sc.pp.neighbors
. Cells are sometimes given different numbers of neighbors. Sometimes that the errant cells have a number of neighbors greater than zero (unlike as seen in #2244, where the # of neighbors of some cells was 0).
My fix builds on the one above and was:
b = np.array(list(map(len, adata_ref.obsp['distances'].tolil().rows))) # number of neighbors of each cell
adata_ref_subset = adata_ref[np.where(b == DEFINED_NEIGHB_NUM-1)[0]] # select only those with the right number
sc.pp.neighbors(adata_ref_subset, DEFINED_NEIGHB_NUM) # rebuild the neighbor graph
@ViriatoII Your comment helped me fix things – but your fix itself didn't work for me.
First, when I use adata_ref = adata_ref[b]
, with b
defined as above, it interprets b
not as a boolean mask but as an index, returning a single cell duplicated over and over. I'm not sure what the intended behavior is here. My solution is to use adata_ref[np.where(b == n_neigh-1)[0]]
.
However, this subselection changes the number of neighbors that other cells have. For example, for my data,
b = np.array(list(map(len,adata_ref.obsp['distances'].tolil().rows)))
print("Before subselecting", np.unique(b, return_counts = True))
adata_ref_subset = adata_ref[np.where(b == DEFINED_NEIGHB_NUM-1)[0]]
c = np.array(list(map(len,adata_ref_subset.obsp['distances'].tolil().rows)))
print("After subselecting", np.unique(c, return_counts = True))
yields
Before subselecting, (array([13, 14]), array([ 28, 1161359]))
After subselecting, (array([10, 11, 12, 13, 14]),
array([ 15, 1, 633, 46, 1160664])))
To solve both problems I needed to rebuild the neighbor graph.
Hi all, did anyone find version with this bug fixed? Thanks.
I am trying to ingest a CITEseq dataset into another clustered dataset. These datasets have different numbers of cells but I ran neighbors(n_neighbors=30) for both prior to running umap. I have confirmed that both datasets have the same variable names and the same number of variable names (38). Both objects look identical when a call adata.var.
I receive the error: "all input arrays must have the same shape".
Versions
scanpy==1.7.0 anndata==0.7.6 umap==0.5.1 numpy==1.21.1 scipy==1.7.0 pandas==1.2.5 scikit-learn==0.24.2 statsmodels==0.12.2 python-igraph==0.9.8