adata.concatenate() won't work properly

SuhanG17 commented 4 years ago

[x] I have checked that this issue has not already been reported.
[x] I have confirmed this bug exists on the latest version of scanpy.
[ ] (optional) I have confirmed this bug exists on the master branch of scanpy.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Comment

I was just going over the GSE92332 case study. On Section 2, the AnnData objects were created fine, but the concatenation method for AnnData object did not work out for me. I browsed through issues raised, there were issues on how to concatenate objects, but they did not seem to answer my question here.

Minimal code sample (that we can copy&paste without having any data)

# Concatenate to main adata object
  adata = adata.concatenate(adata_tmp, batch_key='sample_id')

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Versions

WARNING: If you miss a compact list, please try `print_header`! ----- anndata 0.7.4 scanpy 1.6.0 sinfo 0.3.1 ----- PIL 7.2.0 anndata 0.7.4 anndata2ri 1.0.4 attr 20.2.0 backcall 0.2.0 cairo 1.19.1 certifi 2020.06.20 cffi 1.14.1 chardet 3.0.4 cycler 0.10.0 cython_runtime NA dateutil 2.8.1 decorator 4.4.2 get_version 2.1 gprofiler 1.0.0 h5py 2.10.0 idna 2.10 igraph 0.8.2 ipykernel 5.3.4 ipython_genutils 0.2.0 ipywidgets 7.5.1 jedi 0.17.2 jinja2 2.11.2 joblib 0.16.0 jsonschema 3.2.0 kiwisolver 1.2.0 legacy_api_wrap 1.2 llvmlite 0.34.0 louvain 0.6.1 markupsafe 1.1.1 matplotlib 3.3.1 mpl_toolkits NA natsort 7.0.1 nbformat 5.0.7 numba 0.51.2 numexpr 2.7.1 numpy 1.19.1 packaging 20.4 pandas 1.1.1 parso 0.7.1 pexpect 4.8.0 pickleshare 0.7.5 pkg_resources NA prometheus_client NA prompt_toolkit 3.0.7 ptyprocess 0.6.0 pvectorc NA pygments 2.6.1 pyparsing 2.4.7 pyrsistent NA pytz 2020.1 requests 2.24.0 rpy2 3.3.5 scanpy 1.6.0 scipy 1.5.2 seaborn 0.10.1 send2trash NA setuptools_scm NA sinfo 0.3.1 six 1.15.0 sklearn 0.23.2 statsmodels 0.12.0 storemagic NA tables 3.6.1 terminado 0.8.3 texttable 1.6.3 tornado 6.0.4 traitlets 4.3.3 tzlocal NA urllib3 1.25.10 wcwidth 0.2.5 yaml 5.3.1 zmq 19.0.2 ----- IPython 7.18.1 jupyter_client 6.1.7 jupyter_core 4.6.3 notebook 6.1.3 ----- Python 3.8.5 | packaged by conda-forge | (default, Aug 29 2020, 01:22:49) [GCC 7.5.0] Linux-4.13.0-36-generic-x86_64-with-glibc2.10 4 logical CPU cores, x86_64 ----- Session information updated at 2020-09-08 15:06

xie186 commented 4 years ago

Did you get an idea how to solve this problem? I encountered the same problem.

Here is the version info:

WARNING: If you miss a compact list, please try `print_header`!
-----
anndata     0.7.4
scanpy      1.6.0
sinfo       0.3.1
-----
OpenSSL             19.1.0
PIL                 7.2.0
anndata             0.7.4
appdirs             1.4.4
attr                20.2.0
backcall            0.2.0
bioservices         1.7.9
brotli              NA
bs4                 4.9.1
certifi             2020.06.20
cffi                1.14.1
chardet             3.0.4
colorama            0.4.3
colorlog            NA
cryptography        3.1
cycler              0.10.0
cython_runtime      NA
dateutil            2.8.1
decorator           4.4.2
easydev             0.10.1
get_version         2.1
gseapy              0.10.1
h5py                2.10.0
idna                2.10
ipykernel           5.3.4
ipython_genutils    0.2.0
jedi                0.15.2
jinja2              2.11.2
joblib              0.16.0
jsonschema          3.2.0
kiwisolver          1.2.0
legacy_api_wrap     1.2
llvmlite            0.34.0
lxml                4.5.2
markupsafe          1.1.1
matplotlib          3.3.2
mpl_toolkits        NA
natsort             7.0.1
nbformat            5.0.7
networkx            2.5
numba               0.51.2
numexpr             2.7.1
numpy               1.19.2
packaging           20.4
pandas              1.0.1
parso               0.5.2
pexpect             4.8.0
pickleshare         0.7.5
pkg_resources       NA
prometheus_client   NA
prompt_toolkit      3.0.7
ptyprocess          0.6.0
pvectorc            NA
pygments            2.7.0
pylab               NA
pyparsing           2.4.7
pyrsistent          NA
pytz                2020.1
requests            2.23.0
requests_cache      0.5.2
scanpy              1.6.0
scipy               1.5.2
seaborn             0.11.0
send2trash          NA
setuptools_scm      NA
sinfo               0.3.1
six                 1.15.0
sklearn             0.23.2
socks               1.7.1
soupsieve           2.0.1
statsmodels         0.12.0
storemagic          NA
tables              3.6.1
terminado           0.8.3
tornado             6.0.4
traitlets           5.0.4
urllib3             1.25.10
wcwidth             0.2.5
wrapt               1.12.1
xlsxwriter          1.3.3
zmq                 19.0.2
-----
IPython             7.18.1
jupyter_client      6.1.7
jupyter_core        4.6.3
jupyterlab          2.2.8
notebook            6.1.4
-----
Python 3.8.5 | packaged by conda-forge | (default, Aug 29 2020, 01:22:49) [GCC 7.5.0]
Linux-4.4.0-142-generic-x86_64-with-glibc2.10
64 logical CPU cores, x86_64
-----
Session information updated at 2020-09-16 11:03

Here is the error message:

---------------------------------------------------------------------------
InvalidIndexError                         Traceback (most recent call last)
<ipython-input-37-b22ada65a1cd> in <module>
      1 # Create Concatenated anndata object for all timepoints
      2 #alldays = e125.concatenate(e135, e145, e155, uns_merge="unique")
----> 3 alldays = e125.concatenate(e135)

~/miniconda3/envs/env4sc_velo_scannpy/lib/python3.8/site-packages/anndata/_core/anndata.py in concatenate(self, join, batch_key, batch_categories, uns_merge, index_unique, fill_value, *adatas)
   1696         all_adatas = (self,) + tuple(adatas)
   1697 
-> 1698         out = concat(
   1699             all_adatas,
   1700             axis=0,

~/miniconda3/envs/env4sc_velo_scannpy/lib/python3.8/site-packages/anndata/_core/merge.py in concat(adatas, axis, join, merge, uns_merge, label, keys, index_unique, fill_value, pairwise)
    799         [dim_indices(a, axis=1 - axis) for a in adatas], join=join
    800     )
--> 801     reindexers = [
    802         gen_reindexer(alt_indices, dim_indices(a, axis=1 - axis)) for a in adatas
    803     ]

~/miniconda3/envs/env4sc_velo_scannpy/lib/python3.8/site-packages/anndata/_core/merge.py in <listcomp>(.0)
    800     )
    801     reindexers = [
--> 802         gen_reindexer(alt_indices, dim_indices(a, axis=1 - axis)) for a in adatas
    803     ]
    804 

~/miniconda3/envs/env4sc_velo_scannpy/lib/python3.8/site-packages/anndata/_core/merge.py in gen_reindexer(new_var, cur_var)
    393            [1., 0., 0.]], dtype=float32)
    394     """
--> 395     return Reindexer(cur_var, new_var)
    396 
    397 

~/miniconda3/envs/env4sc_velo_scannpy/lib/python3.8/site-packages/anndata/_core/merge.py in __init__(self, old_idx, new_idx)
    265         self.no_change = new_idx.equals(old_idx)
    266 
--> 267         new_pos = new_idx.get_indexer(old_idx)
    268         old_pos = np.arange(len(new_pos))
    269 

~/miniconda3/envs/env4sc_velo_scannpy/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_indexer(self, target, method, limit, tolerance)
   2731 
   2732         if not self.is_unique:
-> 2733             raise InvalidIndexError(
   2734                 "Reindexing only valid with uniquely valued Index objects"
   2735             )

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Thanks.

Koncopd commented 4 years ago

@xie186 updating pandas should fix the problem.

xie186 commented 4 years ago

@Koncopd Thanks for your response. Updating pandas to 1.1.2 doesn't work in my case. I'm wondering whether there is a docker image or Dockerfile for the latest version from your lab.

xie186 commented 4 years ago

To make you reproduce my error, here is what I did with a Docker image:

singularity -d build sc_velocyto0.17.17_scannpy1.6.0_scvelo0.2.2.img docker://xie186/scrna_tutorial:0.1.0
sudo ufw allow 8689
singularity run --nv sc_velocyto0.17.17_scannpy1.6.0_scvelo0.2.2.img jupyter-lab --ip=<your_ip> --port=8689 --no-browser

Then in browser, open ':8689' and run the code below in jupyterlab.

The data and code I ran is shown as below:

Go to: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE132188
Search "GSE132188_RAW.tar"
Download the file. Uncompress the files:

tar xvf GSE132188_RAW.tar

You will get the following files.

GSM3852752_E12_5_counts.tar.gz
GSM3852753_E13_5_counts.tar.gz
GSM3852754_E14_5_counts.tar.gz
GSM3852755_E15_5_counts.tar.gz

Uncompress:

mkdir -p E12_5_counts/
tar zxvf GSM3852752_E12_5_counts.tar.gz --directory E12_5_counts/
mkdir -p E13_5_counts/
tar zxvf GSM3852753_E13_5_counts.tar.gz --directory E13_5_counts/
mkdir -p E14_5_counts/
tar zxvf GSM3852754_E14_5_counts.tar.gz --directory E14_5_counts/
mkdir -p E15_5_counts/
tar zxvf GSM3852755_E15_5_counts.tar.gz --directory E15_5_counts/

Code:

import numpy as np
import matplotlib.pyplot as pl
import numpy as np
import scanpy as sc
import scanpy.external as sce
import pandas as pd
from anndata import AnnData
import seaborn as sns
from scipy.sparse import csr_matrix
import networkx as nx
import xlsxwriter
from matplotlib import rcParams
import seaborn as sns
import scipy as sci
#GSEApy: Gene Set Enrichment Analysis in Python.
#import gseapy as gp
sc.settings.verbosity = 3
sc.logging.print_versions()

# Read cellranger files for all four samples
filename = './E12_5_counts/mm10/matrix.mtx'
filename_genes = './E12_5_counts/mm10/genes.tsv'
filename_barcodes = './E12_5_counts/mm10/barcodes.tsv'

e125 = sc.read(filename).transpose()
e125.var_names = np.genfromtxt(filename_genes, dtype=str)[:, 1]
e125.obs_names = np.genfromtxt(filename_barcodes, dtype=str)

filename = './E13_5_counts/mm10/matrix.mtx'
filename_genes = './E13_5_counts/mm10/genes.tsv'
filename_barcodes = './E13_5_counts/mm10/barcodes.tsv'

e135 = sc.read(filename).transpose()
e135.var_names = np.genfromtxt(filename_genes, dtype=str)[:, 1]
e135.obs_names = np.genfromtxt(filename_barcodes, dtype=str)

filename = './E14_5_counts/mm10/matrix.mtx'
filename_genes = './E14_5_counts/mm10/genes.tsv'
filename_barcodes = './E14_5_counts/mm10/barcodes.tsv'

e145 = sc.read(filename).transpose()
e145.var_names = np.genfromtxt(filename_genes, dtype=str)[:, 1]
e145.obs_names = np.genfromtxt(filename_barcodes, dtype=str)

filename = './E15_5_counts/mm10/matrix.mtx'
filename_genes = './E15_5_counts/mm10/genes.tsv'
filename_barcodes = './E15_5_counts/mm10/barcodes.tsv'

e155 = sc.read(filename).transpose()
e155.var_names = np.genfromtxt(filename_genes, dtype=str)[:, 1]
e155.obs_names = np.genfromtxt(filename_barcodes, dtype=str)

# Add dev. timepoint label for each sample
e125.obs['day'] = '12.5'
e135.obs['day'] = '13.5'
e145.obs['day'] = '14.5'
e155.obs['day'] = '15.5'

alldays = e125.concatenate(e135, e145, e155)

ivirshup commented 4 years ago

@xie186, are the variable names within each of your objects are unique?

I would guess that would be the problem. Here's a simple case that would throw this error:

import anndata as ad, numpy as np, pandas as pd

a = ad.AnnData(np.ones((3, 2)), var=pd.DataFrame(index=["a", "a"]))
b = ad.AnnData(np.ones((3, 3)), var=pd.DataFrame(index=["a", "b", "c"]))

a.concatenate(b)

I think our merge operation for the variables is only well defined if variable names are unique with each of the objects. I'm not sure there's a good default result here other than throwing an error.

xie186 commented 4 years ago

I'll make sure they are unique and try again. I'm following the note here: https://github.com/theislab/pancreatic-endocrinogenesis/blob/master/scRNA_seq_qc_preprocessing_clustering.ipynb

xie186 commented 4 years ago

@ivirshup Indeed, there are 48 indexes that are not unique which is weird to me. In the link I gave above, the authors didn't have this problem. Maybe it's because of the version differences. Thanks.

LuckyMD commented 4 years ago

adata.obs_names_make_unique() should help you here before concatenating.

xie186 commented 4 years ago

@LuckyMD Thanks for your comments. Do you have an idea why we have duplicate IDs here? Shouldn't Gene ID be unique?

LuckyMD commented 4 years ago

Ah, I missed that it's gene IDs and not obs ids. In that case it's var_names_make_unique() you need. I'm not sure why you would have duplicate genes. Sometimes it can be due to signs that aren't read in properly, i.e. maybe you have gene names ABC-1 and ABC-1 with a space or some other character?

scverse / scanpy