After sc.pp.normalize_total() and log1p() there is no gene expression in UMAP plot

barbareyex commented 1 year ago

Please make sure these conditions are met

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of scanpy.
[ ] (optional) I have confirmed this bug exists on the master branch of scanpy.

What happened?

Hi,

I have two different datasets, both with raw counts. After using the function sc.pp.normalize_total and sc.pp.log1p and plotting the data with UMAP coordinates, there is no gene expression in cells coming from one of the datasets. I did the analysis separately (without concatenating) and the same happens.

I thought that maybe is a problem with the data type, but when I checked this, both anndatas.X were np.float32 and sparse.csr_matrixes ( #1612 ). Also, I made sure the anndata matrix related to the problematic dataset has acceptable values and they are not zeros.

Any idea about this problem?

Minimal code sample

sc.pp.normalize_total(adata)
sc.pp.log1p(adata)

Error output

No response

Versions

``` anndata 0.9.1 scanpy 1.8.1 sinfo 0.3.4 ----- PIL 9.1.0 anyio NA astunparse 1.6.3 attr 21.2.0 babel 2.9.1 backcall 0.2.0 brotli NA certifi 2022.12.07 cffi 1.15.0 charset_normalizer 2.0.12 cloudpickle 2.0.0 cycler 0.10.0 cython_runtime NA dask 2022.8.1 dateutil 2.8.2 debugpy 1.6.0 decorator 5.0.9 defusedxml 0.7.1 entrypoints 0.4 fastjsonschema NA fsspec 2022.7.1 google NA h5py 3.4.0 idna 3.3 igraph 0.9.6 ipykernel 6.4.0 ipython_genutils 0.2.0 jedi 0.18.1 jinja2 3.1.2 joblib 1.2.0 json5 NA jsonschema 3.2.0 jupyter_server 1.11.0 jupyterlab_server 2.8.1 kiwisolver 1.3.2 leidenalg 0.8.7 llvmlite 0.38.0 louvain 0.7.0 markupsafe 2.1.1 matplotlib 3.6.0 mpl_toolkits NA mpmath 1.3.0 natsort 7.1.1 nbclassic NA nbformat 5.1.3 numba 0.55.1 numexpr 2.7.3 numpy 1.21.6 nvfuser NA opt_einsum v3.3.0 packaging 23.1 pandas 1.5.3 parso 0.8.3 pexpect 4.8.0 pickleshare 0.7.5 pkg_resources NA prometheus_client NA prompt_toolkit 3.0.31 psutil 5.9.0 ptyprocess 0.7.0 pvectorc NA pydev_ipython NA pydevconsole NA pydevd 2.8.0 pydevd_file_utils NA pydevd_plugins NA pydevd_tracing NA pygments 2.15.1 pyparsing 2.4.7 pyrsistent NA pytz 2022.2.1 requests 2.28.1 scipy 1.10.1 send2trash NA setuptools 67.8.0 simplejson 3.17.6 six 1.16.0 sklearn 1.2.2 sniffio 1.2.0 socks 1.7.1 sparse 0.14.0 storemagic NA sympy 1.12 tables 3.6.1 tblib 1.7.0 terminado 0.12.1 texttable 1.6.4 threadpoolctl 2.2.0 tlz 0.12.0 toolz 0.12.0 torch 2.0.1+cu117 tornado 6.1 tqdm 4.62.2 traitlets 5.1.0 typing_extensions NA unicodedata2 NA urllib3 1.26.9 wcwidth 0.2.5 websocket 1.3.2 yaml 6.0 zipp NA zmq 22.3.0 zstandard 0.18.0 ----- IPython 7.34.0 jupyter_client 7.3.0 jupyter_core 4.10.0 jupyterlab 3.1.11 notebook 6.4.11 ----- Python 3.8.13 | packaged by conda-forge ```

Zethson commented 1 year ago

Could you please provide a minimal working example of this bug? Can you subsample your AnnData objects and maybe create an artificial recreation of it here, please?

barbareyex commented 1 year ago

This is the anndata working properly:

AnnData object with n_obs × n_vars = 24759 × 29612
    obs: 'sample', 'batch', 'n_counts'
    var: 'ensembl_id', 'n_cells'
    layers: 'counts'

After normalization and logarithmize it with:

adata.X = sc.pp.normalize_total(adata, inplace=False)['X']
adata.X = sc.pp.log1p(adata.X)

And computing PCAs, neighbors and UMAP coordinates this is a plot showing the expression of GRIK1 f.e:

Then, this is the anndata that after normalization does not show gene expression in UMAP:

AnnData object with n_obs × n_vars = 17217 × 33704
    obs: 'Age', 'Condition', 'Origin', 'Region', 'Sex', 'Subject', 'louvain', 'louvain6', 'obs_names', 'sample', 'batch', 'dataset'
    var: 'dispersions', 'dispersions_norm', 'gene_ids', 'highly_variable', 'means', 'n_cells', 'var_names'
    obsm: 'X_umap'
    layers: 'counts'

Because this anndata has pre-computed UMAP coordinates and the raw data was normalized with sizefactors in R, when reading the file, adata.X is already normalized, and if I plot the UMAP for SLC5A11 f.e this is the result:

However, if I select the raw counts of this anndata (stored in layers['counts']) and normalize it with sc.pp.normalizefunction and logarithmize it, this is the output of sc.pl.umap (it doesn't matter re-computing PCAs, neighbors and UMAP):

UMAP after recomputing PCAs, etc:

scverse / scanpy