scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.92k stars 602 forks source link

tl.leiden suddenly produces different results #2956

Closed Celine-Serry closed 6 months ago

Celine-Serry commented 7 months ago

Please make sure these conditions are met

What happened?

When i previously performed leiden clustering on my data, the shape of the UMAP changed, as expected.

However, when i now try to reproduce my results, I suddenly am only able to get the leiden clustering that follows the distribution of the unclustered umap

Unclustered UMAP UMAP_ADvsCT_3-18-2024

Clustered UMAP: UMAP_ADvsCT

the dataset with which i produced the clustered UMAP:

adata3
Out[505]: 
AnnData object with n_obs × n_vars = 13243 × 10850
    obs: 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'sample', 'group', 'disease_status', 'leiden'
    var: 'gene_ids', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'disease_status_colors', 'hvg', 'leiden', 'leiden_colors', 'log1p', 'neighbors', 'pca', 'sample_colors', 'umap'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

the dataset with which i produced the unclustered UMAP:

adata
Out[518]: 
AnnData object with n_obs × n_vars = 13243 × 10850
    obs: 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'pct_counts_hb', 'percent_mt2', 'n_counts', 'sample', 'group', 'disease_status', 'leiden'
    var: 'gene_ids', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'hvg', 'log1p', 'pca', 'neighbors', 'umap', 'leiden', 'leiden_colors'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'

Ive tried to check whether the data is maybe different or something, but i dont see anything that could be causing these differences, could you please help trying to figure out why the leiden clustering suddenly produces different results?

Minimal code sample

sc.pp.neighbors(adata, n_pcs = 30, n_neighbors = 20)
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution = 0.2) 
sc.pl.umap(adata, color='leiden')

Error output

No response

Versions

``` sc.logging.print_versions() ----- anndata 0.10.5.post1 scanpy 1.9.8 ----- PIL 9.4.0 PyQt5 NA adjustText 1.0.4 asttokens NA atomicwrites 1.4.1 bottleneck 1.3.5 brotli NA bs4 4.12.2 certifi 2024.02.02 cffi 1.15.1 chardet 4.0.0 charset_normalizer 2.0.4 cloudpickle 2.2.1 colorama 0.4.6 comm 0.2.1 cycler 0.10.0 cython_runtime NA cytoolz 0.12.0 dask 2023.6.0 dateutil 2.8.2 debugpy 1.8.1 decorator 5.1.1 defusedxml 0.7.1 dill 0.3.8 executing 2.0.1 gseapy 1.1.2 h5py 3.9.0 html5lib 1.1 idna 3.4 igraph 0.11.3 ipykernel 6.29.2 jedi 0.19.1 jinja2 3.1.2 joblib 1.3.2 kiwisolver 1.4.4 leidenalg 0.10.2 llvmlite 0.42.0 lxml 5.1.0 lz4 4.3.2 markupsafe 2.1.1 matplotlib 3.7.2 matplotlib_inline 0.1.6 mkl 2.4.1 mpl_toolkits NA natsort 8.4.0 numba 0.59.0 numexpr 2.8.4 numpy 1.24.3 packaging 23.1 pandas 2.0.3 parso 0.8.3 patsy 0.5.3 pickleshare 0.7.5 platformdirs 3.10.0 prompt_toolkit 3.0.42 psutil 5.9.0 pure_eval 0.2.2 pyarrow 11.0.0 pycparser 2.21 pydeseq2 0.4.7 pydev_ipython NA pydevconsole NA pydevd 2.9.5 pydevd_file_utils NA pydevd_plugins NA pydevd_tracing NA pygments 2.17.2 pynndescent 0.5.11 pyparsing 3.0.9 pythoncom NA pytz 2023.3.post1 pywintypes NA requests 2.31.0 ruamel NA scipy 1.12.0 seaborn 0.13.2 session_info 1.0.0 sip NA six 1.16.0 sklearn 1.4.1.post1 socks 1.7.1 soupsieve 2.4 sparse 0.15.1 sphinxcontrib NA spyder 5.5.1 spyder_kernels 2.5.0 spydercustomize NA stack_data 0.6.2 statsmodels 0.14.0 tblib 1.7.0 texttable 1.7.0 threadpoolctl 3.3.0 tlz 0.12.0 toolz 0.12.0 torch 2.2.0+cpu torchgen NA tornado 6.3.2 tqdm 4.66.2 traitlets 5.7.1 typing_extensions NA umap 0.5.5 urllib3 1.26.18 wcwidth 0.2.13 webencodings 0.5.1 win32api NA win32com NA yaml 6.0 zipp NA zmq 25.1.2 zope NA zstandard 0.19.0 ----- IPython 8.21.0 jupyter_client 8.6.0 jupyter_core 5.3.0 ----- Python 3.11.8 | packaged by conda-forge | (main, Feb 16 2024, 20:40:50) [MSC v.1937 64 bit (AMD64)] Windows-10-10.0.22631-SP0 ----- Session information updated at 2024-03-25 14:49 ```
grst commented 7 months ago

Hi @Celine-075,

constructing the neighborhood graph is not guaranteed to be reproducible

(see also https://github.com/scverse/scanpy/issues/2014)

If all of these things are constant between your two versions, then it's a bug.

Also: are you sure the clustering has actually changed (by comparing the cell barcodes)? Or is it just the UMAP that looks differently, but the clusters are the same?

Celine-Serry commented 7 months ago

Ah I see. I did produce the results on the same machine with the same package version and number of CPUs.

The clustering seems to be hanged which becomes visible from these plots: image This is the unclustered map, where you can see that the bottom group in cluster 1 (orange) actually is pulled toward the group in cluster 2 (green) when I clustered them initially:

clustered_UMAP So here you see that part of cluster 1 is actually added to cluster 2 (which also make sense when looking at the expression profiles of those groups).

ivirshup commented 7 months ago

@Celine-075, I'm about this statement:

When i previously performed leiden clustering on my data, the shape of the UMAP changed, as expected.

This is not expected unless you recompute UMAP. What do you mean by clustered UMAP?

So here you see that part of cluster 1 is actually added to cluster 2 (which also make sense when looking at the expression profiles of those groups).

I'm not sure I can see that, since it's not obvious which point in the first plot corresponds to a point in the other plot. I think a confusion matrix (or using the same UMAP layout) would be a more appropriate way to compare the clusterings here.

Celine-Serry commented 7 months ago

Hmm okay, I thought leiden clustering pulls cells with similar expression closer to each other on the UMAP space? By clustered UMAP, i mean the UMAP produced after i performed leiden clustering on it. By unclustered i mean that I just plotted the UMAP without calculating the leiden clusters.

Then I dont know what happened, but when I plotted the UMAP without leiden clustering performed, it had a different shape in the UMAP then after I calculated the leiden clusters. I will check the confusion matrix and come back to it when I have the results. In the meantime I can only post this image where I put both UMAPs next to each other and drew what I meant about part of cluster1 being added to cluster2 after performing the leiden clustering:

change-umap

ivirshup commented 7 months ago

Hmm okay, I thought leiden clustering pulls cells with similar expression closer to each other on the UMAP space?

No. leiden clustering is just trying to divide the data into a discrete set of clusters. The only output of clustering is a cluster label on each point and some parameters.

I do think people overload the term "cluster", so the confusion here is understandable.

A new UMAP will be generated if you call sc.tl.umap. E.g.

sc.pp.neighbors(adata, n_pcs = 30, n_neighbors = 20)
sc.tl.umap(adata)  # Computes a UMAP layout
sc.tl.leiden(adata, resolution = 0.2)  # Computes a clustering
sc.pl.umap(adata, color='leiden')

If you just want to cluster with different parameters, you can call sc.tl.leiden again. See for example the clustering at multiple resolutions here: https://scanpy.readthedocs.io/en/stable/tutorials/basics/clustering.html#manual-cell-type-annotation

I would note that both leiden and umap rely on the graph generated by sc.pp.neighbors.

Celine-Serry commented 6 months ago

Ah I see now that recalculating sc.pp.neigbors fixes my problem. Thanks!

flying-sheep commented 6 months ago

Great to hear! I’ll close this then, but if you have more questions or concerns about reproducibility, feel free to comment here or make a new issue!