scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.82k stars 586 forks source link

Wrongly ordered DotPlot totals in `scanpy` 1.10.1 with Pandas 1.x #3062

Closed rgoya closed 1 month ago

rgoya commented 2 months ago

Please make sure these conditions are met

What happened?

In scanpy-1.9.8 DotPlots the default ordering of categories is alphabetical, adjusting to what was requested via groupby. This also worked when multiple columns were requested, eliminating the need to manually compose the alphabetical ordering of all existing combinations of observations in the plot.

The default ordering in scanpy>=1.10.0 DotPlots has changed, and resulting plots display wrong data:

The example below shows the misbehaviour using the example in https://scanpy.readthedocs.io/en/stable/generated/scanpy.pl.dotplot.html

Using the code example below; here is the expected plot with scanpy-1.9.8 (same result as in the URL above):

image

and here is the erroneous result with scanpy-1.10.1 and 1.10.0 (wrong ordering, mismatching totals):

image

Minimal code sample

import scanpy as sc

pbmc = sc.datasets.pbmc68k_reduced()

markers = {'T-cell': 'CD3D', 'B-cell': 'CD79A', 'myeloid': 'CST3'}

dp = sc.pl.dotplot(pbmc, markers, 'bulk_labels', return_fig=True)
dp.add_totals().style(dot_edge_color='black', dot_edge_lw=0.5).show()

Error output

(Error output is a bad plot, included in the description above.)

Versions

``` ----- anndata 0.10.7 scanpy 1.10.1 ----- IPython 8.13.2 PIL 10.0.0 asciitree NA asttokens NA astunparse 1.6.3 backcall 0.2.0 cffi 1.15.1 cloudpickle 2.2.1 colorama 0.4.4 cycler 0.10.0 cython_runtime NA cytoolz 0.12.0 dask 2023.10.1 dateutil 2.8.2 decorator 5.1.1 defusedxml 0.7.1 dill 0.3.6 dot_parser NA entrypoints 0.4 exceptiongroup 1.1.1 executing 1.2.0 fasteners 0.17.3 flytekitplugins NA gmpy2 2.1.2 google NA h5py 3.8.0 icu 2.11 igraph 0.11.2 jedi 0.19.1 jinja2 3.1.2 joblib 1.2.0 kiwisolver 1.4.4 legacy_api_wrap NA leidenalg 0.10.2 llvmlite 0.42.0 lz4 4.3.2 markupsafe 2.1.2 matplotlib 3.8.3 mpl_toolkits NA mpmath 1.3.0 msgpack 1.0.5 natsort 8.3.1 numba 0.59.1 numcodecs 0.11.0 numexpr 2.7.3 numpy 1.26.4 packaging 23.1 pandas 1.5.3 parso 0.8.3 pexpect 4.8.0 pickleshare 0.7.5 plotly 5.14.1 prompt_toolkit 3.0.38 psutil 5.9.5 ptyprocess 0.7.0 pure_eval 0.2.2 pyarrow 10.0.1 pydot 1.4.2 pygments 2.15.1 pyparsing 3.0.9 pyteomics NA pytz 2023.3.post1 scipy 1.13.0 session_info 1.0.0 setuptools 67.7.2 setuptools_scm NA six 1.16.0 sklearn 1.2.2 stack_data 0.6.2 sympy 1.11.1 tblib 1.7.0 texttable 1.6.7 threadpoolctl 3.1.0 tlz 0.12.0 toolz 0.11.2 torch 2.1.1 torchgen NA tqdm 4.65.0 traitlets 5.9.0 typing_extensions NA wcwidth 0.2.6 xxhash NA yaml 5.4.1 zarr 2.14.2 zc NA zipp NA zoneinfo NA ----- Python 3.10.10 | packaged by conda-forge | (main, Mar 24 2023, 20:17:34) [Clang 14.0.6 ] macOS-14.4.1-x86_64-i386-64bit ----- Session information updated at 2024-05-15 18:46 ```
flying-sheep commented 2 months ago

Hi, thanks for the report!

Note that the plots in the documentation are generated on the fly when building the documentation. The plot you currently see on https://scanpy.readthedocs.io/en/stable/generated/scanpy.pl.dotplot.html has therefore been created with scanpy 1.10.1

Must be a dependency issue, I’ll try to reproduce with the environment you provided.

/edit: I can reproduce it with that environment:

environment.yml ```yaml name: scanpy-3062 channels: - conda-forge dependencies: - ipykernel - python==3.10.10 - anndata==0.10.7 - scanpy==1.10.1 - IPython==8.13.2 - pillow==10.0.0 - astunparse==1.6.3 - backcall==0.2.0 - cffi==1.15.1 - cloudpickle==2.2.1 - colorama==0.4.4 - cycler==0.10.0 - cytoolz==0.12.0 - dask==2023.10.1 #- dateutil==2.8.2 - decorator==5.1.1 - defusedxml==0.7.1 - dill==0.3.6 - entrypoints==0.4 - exceptiongroup==1.1.1 - executing==1.2.0 - fasteners==0.17.3 - gmpy2==2.1.2 - h5py==3.8.0 #- icu==2.11 - python-igraph==0.11.2 - jedi==0.19.1 - jinja2==3.1.2 - joblib==1.2.0 - kiwisolver==1.4.4 - leidenalg==0.10.2 - llvmlite==0.42.0 - lz4==4.3.2 - markupsafe==2.1.2 - matplotlib==3.8.3 - mpmath==1.3.0 #- msgpack==1.0.5 - natsort==8.3.1 - numba==0.59.1 - numcodecs==0.11.0 - numexpr==2.7.3 - numpy==1.26.4 - packaging==23.1 - pandas==1.5.3 - parso==0.8.3 - pexpect==4.8.0 - pickleshare==0.7.5 - plotly==5.14.1 - prompt_toolkit==3.0.38 - psutil==5.9.5 - ptyprocess==0.7.0 - pure_eval==0.2.2 - pyarrow==10.0.1 - pydot==1.4.2 - pygments==2.15.1 - pyparsing==3.0.9 - pytz==2023.3.post1 - scipy==1.13.0 #- session_info==1.0.0 #- setuptools==67.7.2 - six==1.16.0 - scikit-learn==1.2.2 - stack_data==0.6.2 - sympy==1.11.1 - tblib==1.7.0 - texttable==1.6.7 - threadpoolctl==3.1.0 #- tlz==0.12.0 - toolz==0.11.2 #- pytorch==2.1.1 - tqdm==4.65.0 - traitlets==5.9.0 - wcwidth==0.2.6 #- yaml==5.4.1 - zarr==2.14.2 ```
flying-sheep commented 2 months ago

OK, pretty sure this is because your environment uses pandas 1.5

You can circumvent it for now by setting dp.categories_order = dp.dot_color_df.index:

rgoya commented 2 months ago

Thanks for the quick response, @flying-sheep!

I can confirm that updating pandas-2.2.2 does fix this. I totally missed this possibility; it's not clear to me why the dots would change ordering, but the totals wouldn't (maybe scanpy relies on default pandas behaviour that changed between 1.x and 2.x?). That said, pandas-2.x unfortunately breaks some dependencies in our environment, so I'll either pin scanpy or use your workaround.

Regarding the ordering and issue title change. Maybe a nit, but it's my understanding that the default ordering is alphabetical (which makese perfect sense as a default!). If this is correct, then I'd suggest that the wrong ordering is not the totals, but the categories themselves.

Given this, the workaround that gives me the expected behaviour would be dp.categories_order = dp.dot_color_df.index.sort_values():

image