theislab / scib

Benchmarking analysis of data integration tools
MIT License
283 stars 62 forks source link

scanorama does not subset to HVG list #382

Open mbuttner opened 1 year ago

mbuttner commented 1 year ago

Hi there,

if I incorporate an HVG list in scanorama, the list is ignored and scanorama is run on all genes.

I defined an HVG list as follows (pancreatic islet dataset as example, downloaded from figshare):

hvg_list = list(adata.var_names[adata.var['highly_variable']])
print(hvg_list[:10])

Output:

['A1CF',
 'A2M',
 'AADAC',
 'ABAT',
 'ABCA1',
 'ABCB1',
 'ABCC3',
 'ABCC8',
 'ABCG1',
 'ABCG2']
adata_scanorama = scib.integration.scanorama(adata = adata, batch = batch_key, hvg = hvg_list)

e.g.

/home/marenbuettner/miniconda3/envs/dataint_r/lib/python3.8/site-packages/scanorama/scanorama.py:378: DeprecationWarning: Please use `csr_matrix` from the `scipy.sparse` namespace, the `scipy.sparse.csr` namespace is deprecated.
  elif issubclass(type(ds), scipy.sparse.csr.csr_matrix):
Found 19093 genes among all datasets

Here are my versions:

-----
anndata     0.8.0
scanpy      1.9.1
-----
PIL                         9.4.0
anndata2ri                  1.2.dev11
anyio                       NA
asttokens                   NA
attr                        22.2.0
babel                       2.11.0
backcall                    0.2.0
backports                   NA
brotli                      NA
certifi                     2022.12.07
cffi                        1.15.1
charset_normalizer          3.1.0
colorama                    0.4.6
comm                        0.1.2
cycler                      0.10.0
cython_runtime              NA
dateutil                    2.8.2
debugpy                     1.6.6
decorator                   5.1.1
defusedxml                  0.7.1
deprecated                  1.2.13
executing                   1.2.0
fastjsonschema              NA
google                      NA
h5py                        3.8.0
idna                        3.4
igraph                      0.10.4
importlib_resources         NA
ipykernel                   6.21.1
ipython_genutils            0.2.0
ipywidgets                  8.0.6
jedi                        0.18.2
jinja2                      3.1.2
joblib                      1.2.0
json5                       NA
jsonschema                  4.17.3
jupyter_events              0.6.3
jupyter_server              2.2.1
jupyterlab_server           2.19.0
kiwisolver                  1.4.4
leidenalg                   0.9.1
llvmlite                    0.39.1
louvain                     0.8.0
markupsafe                  2.1.2
matplotlib                  3.6.3
matplotlib_inline           0.1.6
mpl_toolkits                NA
natsort                     8.2.0
nbformat                    5.7.3
numba                       0.56.4
numexpr                     2.8.4
numpy                       1.23.5
packaging                   23.0
pandas                      1.5.3
parso                       0.8.3
pexpect                     4.8.0
pickleshare                 0.7.5
pkg_resources               NA
platformdirs                3.0.0
prometheus_client           NA
prompt_toolkit              3.0.36
psutil                      5.9.4
ptyprocess                  0.7.0
pure_eval                   0.2.2
pvectorc                    NA
pycparser                   2.21
pydev_ipython               NA
pydevconsole                NA
pydevd                      2.9.5
pydevd_file_utils           NA
pydevd_plugins              NA
pydevd_tracing              NA
pygments                    2.14.0
pynndescent                 0.5.8
pyparsing                   3.0.9
pyrsistent                  NA
pythonjsonlogger            NA
pytz                        2022.7.1
pytz_deprecation_shim       NA
requests                    2.29.0
rfc3339_validator           0.1.4
rfc3986_validator           0.1.1
rpy2                        3.5.11
scib                        1.1.3
scipy                       1.10.1
seaborn                     0.12.2
send2trash                  NA
session_info                1.0.0
six                         1.16.0
sklearn                     1.2.1
sniffio                     1.3.0
socks                       1.7.1
stack_data                  0.6.2
statsmodels                 0.13.5
texttable                   1.6.7
threadpoolctl               3.1.0
tornado                     6.2
tqdm                        4.64.1
traitlets                   5.9.0
typing_extensions           NA
tzlocal                     NA
umap                        0.5.3
upsetplot                   0.8.0
urllib3                     1.26.15
wcwidth                     0.2.6
websocket                   1.5.1
wrapt                       1.14.1
yaml                        6.0
zipp                        NA
zmq                         25.0.0
-----
IPython             8.9.0
jupyter_client      8.0.2
jupyter_core        5.2.0
jupyterlab          3.6.1
notebook            6.5.2
-----
Python 3.8.16 | packaged by conda-forge | (default, Feb  1 2023, 16:01:55) [GCC 11.3.0]
Linux-5.4.0-135-generic-x86_64-with-glibc2.10
-----
Session information updated at 2023-04-26 16:52
scanpy==1.9.1 anndata==0.8.0 umap==0.5.3 numpy==1.23.5 scipy==1.10.1 pandas==1.5.3 scikit-learn==1.2.1 statsmodels==0.13.5 python-igraph==0.10.4 louvain==0.8.0 pynndescent==0.5.8
luxeredias commented 3 months ago

I was using the harmony integration method in my pipeline and realized passing HVGs was not doing anything. Looking into the code, it's clear that the hvg argument is only used within the check_sanity function here and further in the check_hvg function here, which ultimately does not subset the input adata to keep only the hvgs:

def check_hvg(hvg, adata_var):
    if type(hvg) is not list:
        raise TypeError("HVG list is not a list")
    else:
        if not all(i in adata_var.index for i in hvg):
            raise ValueError("Not all HVGs are in the adata object")