scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.91k stars 601 forks source link

Subsetted adata object not behaving as expected #2007

Open oligomyeggo opened 3 years ago

oligomyeggo commented 3 years ago

Hello, I am working with an adata object (adata.shape produces (8648, 18074)) that I have subset to only include 990 genes of interest (and only include cells that express my genes of interest), with the hopes of clustering cells based on expression of my genes of interest (I got this idea from issue #510). After I subset my adata object, I confirmed that the shape of adata_sub is as expected (adata_sub.shape produces (6603, 990)). However, after running a new embedding and clustering on adata_sub, I have noticed that I can plot genes that shouldn't be in adata_sub (but were in adata), and that when I run sc.tl.rank_genes_groups my results aren't restricted to my 990 genes of interest. I am guessing that I subsetted my data incorrectly (though, why would I have the correct shape?).

Minimal code sample (that we can copy&paste without having any data)

# subset adata to genes of interest
adata_sub = adata[:, [g in genes_list for g in adata.var_names]].copy()

# filter out cells that don't express any genes of interest
sc.pp.filter_cells(adata_sub, min_genes=1)

# run new embedding and clustering
sc.pp.pca(adata_sub, n_comps=50, use_highly_variable=False, svd_solver='arpack')
sc.pp.neighbors(adata_sub)
sc.tl.umap(adata_sub)
sc.tl.leiden(adata_sub, key_added='leiden_sub')

When I use sc.pl.umap(adata_sub) to plot expression of a gene that is not one of my genes of interest, it is still plotted (I would expect an error telling me that the gene is not found in my adata_sub object). Similarly, the results of sc.tl.rank_genes_groups(adata_sub, groupby='leiden_sub', key_added='rank_genes_sub', method='wilcoxon') returns top ranked genes that are not (or should not be) in my adata_sub object.

Thank you for any help/clarification as to what's going on!

Versions

----- anndata 0.7.6 scanpy 1.8.1 sinfo 0.3.4 ----- PIL 8.3.2 anyio NA attr 21.2.0 babel 2.9.1 backcall 0.2.0 beta_ufunc NA binom_ufunc NA certifi 2019.11.28 cffi 1.14.6 chardet 3.0.4 cycler 0.10.0 cython_runtime NA dateutil 2.8.2 debugpy 1.4.3 decorator 5.1.0 defusedxml 0.7.1 entrypoints 0.3 gi 3.36.0 gio NA glib NA gobject NA gtk NA h5py 3.4.0 idna 2.8 igraph 0.9.6 ipykernel 6.4.1 ipython_genutils 0.2.0 ipywidgets 7.6.4 jedi 0.18.0 jinja2 3.0.1 joblib 1.0.1 json5 NA jsonschema 3.2.0 jupyter_server 1.11.0 jupyterlab_server 2.8.1 kiwisolver 1.3.2 leidenalg 0.8.7 llvmlite 0.37.0 markupsafe 2.0.1 matplotlib 3.4.3 matplotlib_inline NA mpl_toolkits NA natsort 7.1.1 nbclassic NA nbformat 5.1.3 nbinom_ufunc NA networkx 2.6.3 numba 0.54.0 numexpr 2.7.3 numpy 1.20.0 packaging 21.0 pandas 1.3.2 parso 0.8.2 pexpect 4.8.0 pickleshare 0.7.5 pkg_resources NA prometheus_client NA prompt_toolkit 3.0.20 ptyprocess 0.7.0 pvectorc NA pycparser 2.20 pydev_ipython NA pydevconsole NA pydevd 2.4.1 pydevd_concurrency_analyser NA pydevd_file_utils NA pydevd_plugins NA pydevd_tracing NA pygments 2.10.0 pynndescent 0.5.4 pyparsing 2.4.7 pyrsistent NA pytz 2021.1 requests 2.22.0 scipy 1.7.1 seaborn 0.11.2 send2trash NA sitecustomize NA six 1.14.0 sklearn 0.24.2 sniffio 1.2.0 statsmodels 0.13.0rc0 storemagic NA tables 3.6.1 terminado 0.12.1 texttable 1.6.4 tornado 6.1 traitlets 5.1.0 typing_extensions NA umap 0.5.1 urllib3 1.25.8 wcwidth 0.2.5 websocket 1.2.1 zmq 22.3.0 ----- IPython 7.27.0 jupyter_client 7.0.2 jupyter_core 4.7.1 jupyterlab 3.1.11 notebook 6.4.3 ----- Python 3.8.10 (default, Jun 2 2021, 10:49:15) [GCC 9.4.0] Linux-5.10.25-linuxkit-x86_64-with-glibc2.29 6 logical CPU cores, x86_64 ----- Session information updated at 2021-09-30 18:02
oligomyeggo commented 3 years ago

I figured it out. I forgot that I should (or at least, I think I should?) set adata_sub.raw = adata.sub. Adding this step before running a new embedding and clustering seemed to have fixed my issue.

I guess a follow-up question would be is this an acceptable approach? I stored my full data set in raw after log-normalizing my data (so, adata.raw = adata) during initial preprocessing. Since adata_sub is just a subset of adata, I am guessing it is ok to set adata_sub.raw = adata.sub?

Apologies for opening this issue based on what was an oversight on my part.