lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.42k stars 806 forks source link

Error occurs when running dataset has less than 4096 columns #71

Open Xiaojieqiu opened 6 years ago

Xiaojieqiu commented 6 years ago

Hi Leland,

Thanks for this incredible method and algorithm. We love it! However, we find one annoying issue when we run UMAP on an input matrix (in sparseMatrix format) with metric=“correlation” that has less than 4096 columns. It throws this error:

Error in py_call_impl(callable, dots$args, dots$keywords) : TypeError: scipy distance metrics do not support sparse matrices.

We find that this issue can be fixed by simply removing the following lines of code from the umap_.py

# Handle small cases efficiently by computing all distances
       if X.shape[0] < 4096:
           dmat = pairwise_distances(X, metric=self.metric, **self.metric_kwds)
           self.graph = fuzzy_simplicial_set(
               dmat,
               self.n_neighbors,
               random_state,
               ‘precomputed’,
               self.metric_kwds,
               self.angular_rp_forest,
               self.set_op_mix_ratio,
               self.local_connectivity,
               self.bandwidth,
               self.verbose
           )
       else:

This issue maybe related to #12

lmcinnes commented 6 years ago

Hmm, I had assumed sklearn's pairwise_distances would support sparse matrices, but it looks like correlation doesn't. I can patch this directly by adding and metric != 'correlation' to the code, but actually start to wonder if the sparse correlation metric I have really works as it should. I think more accurately correlation is not really a metric supported for sparse matrices. Ultimately I need to fix that problem. In the meantime you can either convert the matrix to dense, or continue what you are doing if that works for you.

Xiaojieqiu commented 6 years ago

Hi Leland,

Thanks for your prompt response! If you can patch the repo by adding and metric != 'correlation or something, that will be great.

However, I am a little concerned about your message on correlation is not really a metric supported for sparse matrices... Could you please clarify that? Do you mean correlation cannot be calculated from the sparse matrices mathematically (which I don't believe so)? or do you mean your implementation of correlation for large datasets (with more than 4096 samples) with sparse matrix is potentially problematic? Right now we find that using correlation metric for large dataset with sparse matrix give us great results although we have the issue for small datasets. So I don't think that your implementation could be problematic...

lmcinnes commented 6 years ago

I think the sparse version of correlation may not be a true correlation distance -- I would have to check to be sure. If it is giving your good results then it probably does work correctly, and I'm worrying over nothing. I'll fix the distance measure for now so it should work more correctly for your case.

On Wed, May 23, 2018 at 6:43 PM, Xiaojie Qiu notifications@github.com wrote:

Hi Leland,

Thanks for your prompt response! If you can patch the repo by adding and metric != 'correlation or something, that will be great.

However, I am a little concerned about your message on correlation is not really a metric supported for sparse matrices... Could you please clarify that? Do you mean correlation cannot be calculated from the sparse matrices mathematically (which I don't believe so)? or do you mean your implementation of correlation for large datasets (with more than 4096 samples) with sparse matrix is potentially problematic? Right now we find that using correlation metric for large dataset with sparse matrix give us great results although we have the issue for small datasets. So I don't think that your implementation could be problematic...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/71#issuecomment-391522527, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBZ3xoz29IDLJNQ3yg6uPFyR7Xw5xks5t1eYhgaJpZM4ULH_R .

Xiaojieqiu commented 6 years ago

Thank you Leland! Please let me know what is the potential issue regarding your current sparse matrix based correlation distance calculation.

lmcinnes commented 6 years ago

My concern is as follows: for correlation one should shift the data by the mean value, and then essentially compute cosine distance of the shifted data -- I believe the shifting in the sparse case is ignoring the zeros (as most of the parse computations do -- or else this would become dense and potentially messy for very dimension sparse datasets). That makes the result not an accurate correlation distance. The error can potentially be small for many cases, but reviewing the code I cannot guarantee accuracy.

On Wed, May 23, 2018 at 7:59 PM, Xiaojie Qiu notifications@github.com wrote:

Thank you Leland! Please let me know what is the potential issue regarding your current sparse matrix based correlation distance calculation.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/71#issuecomment-391541059, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBfZ4XNYDug8EyN2ac2oP-gi_Q09sks5t1ff9gaJpZM4ULH_R .

lmcinnes commented 6 years ago

Okay, I believe the current master should now fix your issue, and correctly compute correlation distance for sparse matrix input data. It may be a little slower as the correct computation is a little more expensive. You may want to consider cosine distance as a faster alternative that will be similar.

Xiaojieqiu commented 6 years ago

that is great. thank you for the quick fix and detailed explanation. That has been very helpful. I will check the updated version now. (and we may close this issue for now?)

lmcinnes commented 6 years ago

You can close it for now, but feel free to re-open it if this doesn't resolve the issue for you.

On Wed, May 23, 2018 at 8:54 PM, Xiaojie Qiu notifications@github.com wrote:

that is great. thank you for the quick fix and detailed explanation. That has been very helpful. I will check the updated version now. (and we may close this issue for now?)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/71#issuecomment-391552495, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBdoeNkYlXUGhZInJpuLsXe1d3YR_ks5t1gTMgaJpZM4ULH_R .

Xiaojieqiu commented 6 years ago

btw, one quick suggestion, since you mentioned that it is slower for performing this fix. is that possible to first check the data first to ensure there are zero values before applying the fix and otherwise we can use what we have before?

lmcinnes commented 6 years ago

If you don't have any zeros then you can convert he input to dense (use .toarray()) with no loss and use the correlation distance there. Alternatively the quick check is if the value of the .nnz property on the input data is equal to the number of samples times the number of features.

On Wed, May 23, 2018 at 9:08 PM, Xiaojie Qiu notifications@github.com wrote:

btw, one quick suggestion, since you mentioned that it is slower for performing this fix. is that possible to first check the data first to ensure there are zero values before applying the fix and otherwise we can use what we have before?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/71#issuecomment-391555323, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBVtOQE1-1XHLJHYvIP5gSDsJkjLeks5t1ggNgaJpZM4ULH_R .

Xiaojieqiu commented 6 years ago

thank you for point out this! I am closing the issue for now

Xiaojieqiu commented 6 years ago

@lmcinnes Hi Leland, thanks for fixing this issue again! I wonder whether you can help us release umap with this fix to PyPI. We are planning to release another R package (I can provide more details later) which will be dependent on UMAP. We would like to make sure the users to be able to use this newest version of umap for their analysis directly from R. Unfortunately R only provides ways to install python packages from PyPI but not from github .

lmcinnes commented 6 years ago

umap-learn 0.2.4 should now be on PyPI. Let me know if that works for you.

Xiaojieqiu commented 6 years ago

That is cool. Thanks a lot for your prompt response! I really appreciated that! I am able to install 0.2.4 from PyPi in R now.

I am reopening this issue for now in case the users may have any other potential questions and I can discuss them here.

Xiaojieqiu commented 6 years ago

@lmcinnes Hi Leland, I did a more close check between the correlation, cosine and the Euclidean metric for a dataset with about 2k samples with the umap-learn 0.2.4 version. Interestingly, I found UMAP reduced dimension from euclidean and cosine metric are very similar and forms a continuous manifold but the correlation metric gives way different result and the manifold is clustered into separate groups (See figures below). I wonder maybe there is a potential typo or something in your recent changes to the correlation metric for sample less than 4096 columns? cosine: plot_cosine euclidean: plot_euclidian correlation: monocle_plot_3d_correlation

lmcinnes commented 6 years ago

The easiest way to check at this point would be to cast your matrix to dense as that will avoid running the new code and fall back to sklearn correlation distance. If that gives notably different results then yes, there is still a bug somewhere in the sparse correlation distance computation.

Xiaojieqiu commented 6 years ago

Thanks for the suggestions! I double checked and indeed found that converting the sparse matrix into dense matrix gives us much continuous and less separated low-dimensional embedding: correlation metric after converting the sparse matrix to a dense matrix:
screen shot 2018-06-18 at 15 08 59

lmcinnes commented 6 years ago

Okay, there are still some bugs to be worked out then. Sparse correlation is hard, unfortunately.

On Mon, Jun 18, 2018 at 6:12 PM, Xiaojie Qiu notifications@github.com wrote:

Thanks for the suggestions! I double checked and indeed found that converting the sparse matrix into dense matrix gives us much continuous and less separated low-dimensional embedding: correlation metric after converting the sparse matrix to a dense matrix: [image: screen shot 2018-06-18 at 15 08 59] https://user-images.githubusercontent.com/7456281/41564974-99449666-7309-11e8-9e3d-38e5d98414c2.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/71#issuecomment-398212163, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBZQqK_MeLIlYvb_fbG5LTXlfs_Wcks5t-CXZgaJpZM4ULH_R .

Xiaojieqiu commented 6 years ago

yeah, take your time to fix it.... we will use cosine metric for now.

lmcinnes commented 6 years ago

Found it! New package is on PyPI (0.2.5) if that is helpful. Sorry if this was a little late.

sleighsoft commented 5 years ago

@lmcinnes and @Xiaojieqiu did all problems get resolved? Can this be closed?