lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.44k stars 807 forks source link

unexpected UMAP embeddings with 0.5.2 release using single-cell gene expression data via Scanpy #798

Closed cornhundred closed 3 years ago

cornhundred commented 3 years ago

Hi, we are seeing unexpected UMAP embeddings using the 0.5.2 umap-learn version, run via Scanpy, with our single cell gene expression data (publicly available MERFISH data from Vizgen).

Our original embedding using version 0.5.1 looks like

Screen Shot 2021-10-29 at 3 46 04 PM

and the embedding with 0.5.2 looks like

Screen Shot 2021-10-29 at 3 46 24 PM

Zooming into the 0.5.2 embedding reveals that cells appear to be embedded into a lattice like structure

Screen Shot 2021-10-29 at 3 46 57 PM

We're wondering if this is being caused in part by some sort of a rounding error in the embedding.

We have included Colab notebooks demonstrating the normal behavior using version 0.5.1 and the new unexpected behavior using version 0.5.2. Please let us know if you have any issues running the notebooks - they require authentication via Google to load the publicly available data and there are static and interactive versions of the UMAP embeddings.

The only differences between the notebooks are where we use pip to install a specific version of umap-learn or use Scanpy's version.

# pinning to previous 0.5.1 version
# otherwise scanpy grabs umap-learn==0.5.2 (see below)
###########################################
!pip install -q umap-learn==0.5.1

We also tested using the basic usage examples from the documentation and these examples appear to be working with the new 0.5.2 version - see colab notebook Basic_Usage_Test-UMAP_0.5.2.ipynb

cornhundred commented 3 years ago

We can also reproduce this issue using the Scanpy pbmc3k tutorial example

Screen Shot 2021-10-29 at 4 25 46 PM

lmcinnes commented 3 years ago

That is definitely disconcerting. I'll try to look into what the issue may be. It looks rather like you are just getting the spectral initialization instead of the UMAP embedding out.

lmcinnes commented 3 years ago

So first of all basic UMAP seems to be working and doesn't produce results like this (as you note that the basic usage tutorial seems to work). That's a good start, as at least the package itself isn't broken. That means it is most likely in the interaction of scanpy and UMAP.

My best guess, on first glance, is that some parameter options may have shuffled around. Ideally everything should be keyword only (I haven't followed scikit-learn in enforcing that and making that standard yet, but this is a good reason why I should), but if it is positionally called in scanpy that might be the problem.

lmcinnes commented 3 years ago

Alright, I think I see the problem. Included in 0.5.2 is commit e442bcd9323fd218fc4a3a6287baa1067512dfe1 which allows n_epochs to be zero to get the initial embedding out, which several people wanted. Unfortunately internally to scanpy they set n_epochs = 0 which used to be a way to get an automatically set value. That now needs to be n_epochs=None. You can work around this right now by setting maxiter in the scanpy call. A value of 200 is probably good.

Edit: I should note that this is in calls to an internal umap function, and not the public API, which remained the same

cornhundred commented 3 years ago

Thanks @lmcinnes for the quick response. Would you all want to roll back that change since it effectively changed the API, but the version name would be interpreted as only a bug fix. Otherwise Scanpy will have to roll out some sort of update.

lmcinnes commented 3 years ago

It was an update to a function that isn't part of the public facing API, so I was not anticipating issues. I've submitted a change to scanpy that should resolve the issue. I would be happy to discuss options, but would rather not make a roll-back release if I don't have to.

cornhundred commented 3 years ago

Ok that makes sense. Thanks again for the quick response

cornhundred commented 3 years ago

Closing since this is something that Scanpy will resolve.