perslab / CELLEX

CELLEX (CELL-type EXpression-specificity)
GNU General Public License v3.0
37 stars 9 forks source link

parse_input does not detect invalid metadata value types #24

Open rrydbirk opened 4 years ago

rrydbirk commented 4 years ago

I'm trying to run CELLEX on ~120k cells from a single-nucleus experiment. In the preprocessing step, the ANOVA gene filtering removes all my genes. Of course, as you describe in the workflow, I could just omit this step (ANOVA=False), however, it seems relevant to include.

What would you recommend I do? Would some initial gene filtering for low-expressed genes or similar help?

Preprocessing - checking input ... input parsed in 0 min 0 sec Preprocessing - running remove_non_expressed ... excluded 0 / 28621 genes in 0 min 56 sec Preprocessing - normalizing data ... data normalized in 2 min 18 sec Preprocessing - running ANOVA ... excluded 28621 / 28621 genes in 2 min 14 sec

tstannius commented 4 years ago

Hi Rasmus!

I will need a little more to go on. Would you kindly share the code (a minimal example) needed to reproduce this output? Please also include the output of calling

data.head()
metadata.head()

And the versions of the modules you are using.

rrydbirk commented 4 years ago

Hi Tobias

Thanks for getting back to me. This is my code. Please note, since I'm coming from R and my count matrix is too big to write to .csv, I used HDF5 instead. It's a bit dirty since I couldn't get it to transfer row and column names directly (and I'm not that familiar with Python in general):

import numpy as np
import pandas as pd
import cellex
import h5py

data = pd.DataFrame(np.array(h5py.File("data.h5")['data']))
data.columns = pd.read_csv("./cells.csv").values
data.index = pd.read_csv("./genes.csv").values
data = data.astype(dtype="int64")

metadata = pd.read_csv("./metadata.csv", index_col=0)

eso = cellex.ESObject(data=data, annotation=metadata, verbose=True)

The requested output. I omitted sensitive data:

data.head()
               (Cell1_AAACCCAAGATCCGAG-1,)  ...  (Celln_TTTGTTGTCTAACGGT-1,)
(AL627309.1,)                                  0  ...                                  0
(AC114498.1,)                                  0  ...                                  0
(AL669831.2,)                                  0  ...                                  0
(AL669831.5,)                                  1  ...                                  0
(FAM87B,)                                      0  ...                                  0

[5 rows x 118345 columns]
>>> metadata.head()
                               cell_type
cell_id
Cell1_ACTACGAGTATGCGGA-1    Type1
Cell2_AAACCCAAGTGCGCTC-1    Type1
Cell3_ACTGATGGTAACACGG-1    Type1
Cell4_CACTGAATCACAATGC-1    Type1
Cell5_CAGGGCTCACCAACAT-1    Type1

And the versions: ➜ Cellex python Python 3.7.4 (default, Aug 13 2019, 20:35:49) [GCC 7.3.0] :: Anaconda, Inc. on linux

➜ Cellex pip list

Click to show

Package Version ---------------------------------- --------- adjustText 0.7.3 alabaster 0.7.12 anaconda-client 1.7.2 anaconda-navigator 1.9.7 anaconda-project 0.8.3 aniso8601 8.0.0 annoy 1.16.3 asn1crypto 1.0.1 astroid 2.3.1 astropy 3.2.2 atomicwrites 1.3.0 attrs 19.2.0 Babel 2.7.0 backcall 0.1.0 backports.functools-lru-cache 1.5 backports.os 0.1.1 backports.shutil-get-terminal-size 1.0.0 backports.tempfile 1.0 backports.weakref 1.0.post1 beautifulsoup4 4.8.0 bitarray 1.0.1 bkcharts 0.2 bleach 3.1.0 bokeh 1.3.4 boto 2.49.0 boto3 1.7.84 botocore 1.10.84 Bottleneck 1.2.1 cellex 1.1.1 CellPhoneDB 2.1.2 certifi 2019.9.11 cffi 1.12.3 chardet 3.0.4 click 6.7 cloudpickle 1.2.2 clyent 1.2.2 colorama 0.4.1 conda 4.8.2 conda-build 3.18.9 conda-package-handling 1.6.0 conda-verify 3.4.2 contextlib2 0.6.0 cryptography 2.7 cycler 0.10.0 Cython 0.29.13 cytoolz 0.10.0 dask 2.5.2 decorator 4.4.0 defusedxml 0.6.0 descartes 1.1.0 distributed 2.5.2 docutils 0.15.2 entrypoints 0.3 et-xmlfile 1.0.1 fastcache 1.1.0 fbpca 1.0 filelock 3.0.12 Flask 1.0.4 Flask-RESTful 0.3.8 Flask-Testing 0.7.1 fsspec 0.5.2 future 0.17.1 geosketch 0.3 gevent 1.4.0 glob2 0.7 gmpy2 2.0.8 greenlet 0.4.15 h5py 2.9.0 HeapDict 1.0.1 html5lib 1.0.1 idna 2.7 imageio 2.6.0 imagesize 1.1.0 importlib-metadata 0.23 ipykernel 5.1.2 ipython 7.8.0 ipython-genutils 0.2.0 ipywidgets 7.5.1 isort 4.3.21 itsdangerous 1.1.0 jdcal 1.4.1 jedi 0.15.1 jeepney 0.4.1 Jinja2 2.10.3 jmespath 0.9.5 joblib 0.13.2 json5 0.8.5 jsonschema 3.0.2 jupyter 1.0.0 jupyter-client 5.3.3 jupyter-console 6.0.0 jupyter-core 4.5.0 jupyterlab 1.1.4 jupyterlab-server 1.0.6 keyring 18.0.0 kiwisolver 1.1.0 lazy-object-proxy 1.4.2 libarchive-c 2.8 lief 0.9.0 llvmlite 0.29.0 locket 0.2.0 loompy 3.0.6 lxml 4.4.1 MarkupSafe 1.1.1 matplotlib 3.1.1 mccabe 0.6.1 mistune 0.8.4 mizani 0.6.0 mkl-fft 1.0.14 mkl-random 1.1.0 mkl-service 2.3.0 mock 3.0.5 more-itertools 7.2.0 mpmath 1.1.0 msgpack 0.6.1 multipledispatch 0.6.0 navigator-updater 0.2.1 nbconvert 5.6.0 nbformat 4.4.0 networkx 2.3 nltk 3.4.5 nose 1.3.7 notebook 6.0.1 numba 0.45.1 numexpr 2.7.0 numpy 1.17.2 numpy-groupies 0+unknown numpydoc 0.9.1 olefile 0.46 openpyxl 3.0.0 packaging 19.2 palettable 3.3.0 pandas 1.0.3 pandocfilters 1.4.2 parso 0.5.1 partd 1.0.0 path.py 12.0.1 pathlib2 2.3.5 patsy 0.5.1 pep8 1.7.1 pexpect 4.7.0 pickleshare 0.7.5 pika 0.12.0 Pillow 6.2.0 pip 19.2.3 pkginfo 1.5.0.1 plotnine 0.6.0 pluggy 0.13.0 ply 3.11 prometheus-client 0.7.1 prompt-toolkit 2.0.10 psutil 5.6.3 ptyprocess 0.6.0 py 1.8.0 pyarrow 0.17.1 pycodestyle 2.5.0 pycosat 0.6.3 pycparser 2.19 pycrypto 2.6.1 pycurl 7.43.0.3 pyflakes 2.1.1 Pygments 2.4.2 pylint 2.4.2 pyodbc 4.0.27 pyOpenSSL 19.0.0 pyparsing 2.4.2 pyreadr 0.2.9 pyrsistent 0.15.4 pysam 0.15.4 PySocks 1.7.1 pytest 5.2.1 pytest-arraydiff 0.3 pytest-astropy 0.5.0 pytest-doctestplus 0.4.0 pytest-openfiles 0.4.0 pytest-remotedata 0.3.2 python-dateutil 2.8.0 pytz 2019.3 PyWavelets 1.0.3 PyYAML 5.1.2 pyzmq 18.1.0 QtAwesome 0.6.0 qtconsole 4.5.5 QtPy 1.9.0 requests 2.19.1 rope 0.14.0 rpy2 3.0.5 ruamel-yaml 0.15.46 s3transfer 0.1.13 scikit-image 0.15.0 scikit-learn 0.21.3 scipy 1.3.1 scrublet 0.2.1 seaborn 0.9.0 SecretStorage 3.1.1 Send2Trash 1.5.0 setuptools 41.4.0 setuptools-scm 4.1.1 simplegeneric 0.8.1 singledispatch 3.4.0.3 six 1.12.0 snowballstemmer 2.0.0 sortedcollections 1.1.2 sortedcontainers 2.1.0 soupsieve 1.9.3 Sphinx 2.2.0 sphinxcontrib-applehelp 1.0.1 sphinxcontrib-devhelp 1.0.1 sphinxcontrib-htmlhelp 1.0.2 sphinxcontrib-jsmath 1.0.1 sphinxcontrib-qthelp 1.0.2 sphinxcontrib-serializinghtml 1.1.3 sphinxcontrib-websupport 1.1.2 spyder 3.3.6 spyder-kernels 0.5.2 SQLAlchemy 1.3.9 statsmodels 0.10.1 sympy 1.4 tables 3.5.2 tblib 1.4.0 terminado 0.8.2 testpath 0.4.2 toolz 0.10.0 tornado 6.0.3 tqdm 4.32.2 traitlets 4.3.3 tzlocal 2.1 umap-learn 0.3.10 unicodecsv 0.14.1 urllib3 1.23 velocyto 0.17.17 wcwidth 0.1.7 webencodings 0.5.1 Werkzeug 0.16.0 wheel 0.33.6 widgetsnbextension 3.5.1 wrapt 1.11.2 wurlitzer 1.0.3 xlrd 1.2.0 XlsxWriter 1.2.1 xlwt 1.3.0 zict 1.0.0 zipp 0.6.0

If you need the actual data, I can anonymize it and share it with you. Let me know if you need anything else.

tstannius commented 4 years ago

Thanks!

It makes good sense to store your data as hdf5 and we will also transition to this format in the next release :-)

It seems that the type (tuple) of the row and column names are causing the error. They should be strings instead.

Do try something along the lines of the following and let me know how it goes.

data = pd.DataFrame(np.array(h5py.File("data.h5")['data']), dtype="float32")
# I think float32 should be sufficient
cols = pd.read_csv("./cells.csv").values
cols = [c[0] for c in cols if type(c) is tuple]
# may give an error, if not all colnames were tuples and you will have to do some more work,
# but rather that than a silent error!
data.columns = cols

# same approach for idx
idx = pd.read_csv("./genes.csv").values
# do stuff ...
data.index = idx

metadata = pd.read_csv("./metadata.csv", index_col=0)

eso = cellex.ESObject(data=data, annotation=metadata, verbose=True)
rrydbirk commented 4 years ago

Unfortunately, your code erased everything in cols:

>>> cols = pd.read_csv("./cells.csv").values
>>> cols.size
118345
>>> cols = [c[0] for c in cols if type(c) is tuple]
>>> cols.size
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'list' object has no attribute 'size'
>>> cols
[]

Instead, this did the trick:

import numpy as np
import pandas as pd
import cellex
import h5py

data = pd.DataFrame(np.array(h5py.File("data.h5")['data']), dtype="float32")
data.columns = pd.read_csv("./cells.csv").values
data.columns = [v[0] for v in data.columns]
data.index = pd.read_csv("./genes.csv").values
data.index = [v[0] for v in data.index]

metadata = pd.read_csv("./metadata.csv", index_col=0)

eso = cellex.ESObject(data=data, annotation=metadata, verbose=True)
Preprocessing - checking input ... input parsed in 0 min 0 sec
Preprocessing - running remove_non_expressed ... excluded 0 / 28621 genes in 1 min 8 sec
Preprocessing - normalizing data ... data normalized in 2 min 24 sec
Preprocessing - running ANOVA ... excluded 4172 / 28621 genes in 2 min 44 sec

Also, I now see that I'm running on an old version of HDF5Array (1.10.1) since I'm still on R 3.5.1. In newer versions (at least 1.16.0), writeHDF5Array includes the option "with.dimnames" which should circumvent my problems. So, for future reference, going from R to Python with large sparsematrices, this should suffice: writeHDF5Array(matrix, "./data.h5", "data", verbose = T, with.dimnames=T)

tstannius commented 4 years ago

Glad to hear that!

I will keep the issue open until the input check function has been updated to handle this kind of issue.