lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.45k stars 807 forks source link

NA returned with Warning: Embedding 8 connected components using meta-embedding (experimental) n_components #90

Open YubinXie opened 6 years ago

YubinXie commented 6 years ago

Working enviroment: Mac OS 13.5, PYTHON 3.6. My data is 660K * 6 dimensions. Firstly, I tried n_neighbors =100. It worked fine. Then I tried n_neighbors=15, it gave warning:

lib/python3.6/site-packages/umap/spectral.py:229: UserWarning: Embedding 8 connected components using meta-embedding (experimental) n_components

And the returned embedding is all NA. Then I also tried n_neighbors =200,500, all the embedding is NA. I am not sure what happened.

Thank you!

lmcinnes commented 6 years ago

I suspect the spectral initialisation is failing for one reason or another. This can often happen for particularly oddly distributed data. As a workaround you can use init='random' as a parameter to UMAP. It should stop the NaNs happening at least. This isn't ideal, but it should get you past the immediate problem. I'll try to look into the deeper issue soon.

On Sat, Jul 21, 2018 at 2:46 PM Yubin notifications@github.com wrote:

Working enviroment: Mac OS 13.5, PYTHON 3.6. My data is 660K * 6 dimensions. Firstly, I tried n_neighbors =100. It worked fine. Then I tried n_neighbors=15, it gave warning:

lib/python3.6/site-packages/umap/spectral.py:229: UserWarning: Embedding 8 connected components using meta-embedding (experimental) n_components

And the returned embedding is all NA. Then I also tried n_neighbors =200,500, all the embedding is NA. I am not sure what happened.

Thank you!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/90, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBRtAWlN0oltfU-aY_lOMdy6zb17Uks5uI3cKgaJpZM4VZuTT .

YubinXie commented 6 years ago

Thank you for your quick reply and insightful suggestion. Yes, my data distribution is odd (very spare and low dimension). But any idea that why it worked before, and all of a sudden, it no longer works? Is different run of UMAP affect each other? Thanks! (UMAP is a great visualization tool!)

lmcinnes commented 6 years ago

The nearest neighbor computation is approximate and somewhat stochastic, so you can get differences. More likely 100 was a large enough value that it didn't shatter things to badly and break the spectral initialisation -- larger values may have been large enough to create a very small eigengap and break things in other ways. It is odd that it was that sensitive. If it is at all possible to share your data I would be interested to experiment and try to reproduce the problem -- but I certainly understand if you can't.

On Sat, Jul 21, 2018 at 2:57 PM Yubin notifications@github.com wrote:

Thank you for your quick reply and insightful suggestion. Yes, my data distribution is odd (very spare and low dimension). But any idea that why it worked before, and all of a sudden, it no longer works? Is different run of UMAP affect each other? Thanks! (UMAP is a great visualization tool!)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/90#issuecomment-406816648, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBTMH3kwLl3o7i-byZbhYxeRNRZNZks5uI3mVgaJpZM4VZuTT .

YubinXie commented 6 years ago

I tried init='random', it still gives NA when n_neighbors =15. To speed up, I used a small part of data, and now when n_neighbors =15, it is still NA. But when n_neighbors =200, 500, the results are good.

The data I am using is currently confidential medical data. But will let you know once it goes public. Thanks.

lmcinnes commented 6 years ago

There's definitely a bug in there then -- unfortunately one I can't reproduce right now. Thanks for the report though: I'll be on the lookout for anytign similar and hopefully I can track it down.

On Sat, Jul 21, 2018 at 4:27 PM Yubin notifications@github.com wrote:

I tried init='random', it still gives NA when n_neighbors =15. To speed up, I used a small part of data, and now when n_neighbors =15, it is still NA. But when n_neighbors =200, 500, the results are good.

The data I am using is currently confidential medical data. But will let you know once it goes public. Thanks.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/90#issuecomment-406821571, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBSqwHhSXn-cAPnXbwW13WdNFf2J6ks5uI46mgaJpZM4VZuTT .

YubinXie commented 6 years ago

Some update: n_neighbors =100 used to work. But when I restarted my jupyter notebook, n_neighbors =100 also returned NA. I think the default random seed should be the same. Not sure why this is happening.

lmcinnes commented 6 years ago

This is certainly odd. I'll have to see if I can manage to reproduce the problem to try and track down what is going astray.

On Mon, Jul 23, 2018 at 1:37 PM Yubin notifications@github.com wrote:

Some update: n_neighbors =100 used to work. But when I restarted my jupyter notebook, n_neighbors =100 also returned NA. I think the default random seed should be the same. Not sure why this is happening.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/90#issuecomment-407140080, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBU4YvCOten5tZYVztmm0IcDLiqJIks5uJgnSgaJpZM4VZuTT .

JoshuaC3 commented 6 years ago

I found this issue (Embedding 6 connected instead), and another issue,

"Random Projection forest initialisation failed due to recursion"
            "limit being reached. Something is a little strange with your "
            "data, and this may take longer than normal to compute."

when trying to embed weekday data. Hence, I had only 7 unique rows in my np.array and, after much spluttering and wasted time I got the following plot,

weekday_embedding (Interesting how it embeds each instance of the same day of the week in a sightly different (x, y) co-ordinate... @lmcinnes any intuition on this?)

My suggestion would be a check on the number of unique rows in the np.array a users provides. Still, I am not 100% certain that this was the cause of my two warnings but hopefully it might help.

Note: the plot only shows 6 days. I took the first 1000 rows to speed up plotting. Data was ordered by weekday so we missed off the Sundays.

lmcinnes commented 6 years ago

Each instance is viewed as an independent object, so even if they are identical in the data they are treated as technically separate, and thus embed into different locations.

Checking for unique rows is certainly an option, but a very expensive one computationally. There are other checks that should catch such situations, so I'm not sure whether this was technically the problem or not.

On Wed, Aug 1, 2018 at 7:10 AM JoshuaC3 notifications@github.com wrote:

I found this issue (Embedding 6 connected instead), and another issue,

"Random Projection forest initialisation failed due to recursion" "limit being reached. Something is a little strange with your " "data, and this may take longer than normal to compute."

when trying to embed weekday data. Hence, I had only 7 unique rows in my np.array and, after much spluttering and wasted time I got the following plot,

[image: weekday_embedding] https://user-images.githubusercontent.com/11645712/43518194-2d1d86ea-9583-11e8-83a6-0e77845c44a7.png (Interesting how it embeds each instance of the same day of the week in a sightly different (x, y) co-ordinate... @lmcinnes https://github.com/lmcinnes any intuition on this?)

My suggestion would be a check on the number of unique rows in the np.array a users provides. Still, I am not 100% certain that this was the cause of my two warnings but hopefully it might help.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/90#issuecomment-409539281, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBRO5IscJpFphYRo3jaZ1BTr4HKyyks5uMYyygaJpZM4VZuTT .

ahartikainen commented 6 years ago

I have had this problem too. I transformed some image data to (n,m), and it happened when n was large.

Maybe there is some underflow/overflow/negative-log-sqrt somewhere?

lmcinnes commented 6 years ago

It could be -- as far as I can tell it is happening in sklearn nor umap -- so it remains to track down which sklearn calls are at issue (we don't have a traceback). I'll continue looking for a consistent reproducer.

On Sat, Aug 4, 2018 at 2:26 AM Ari Hartikainen notifications@github.com wrote:

I have had this problem too. I transformed some image data to (n,m), and it happened when n was large.

Maybe there is some underflow/overflow/negative-log-sqrt somewhere?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/90#issuecomment-410427559, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBWRIfJl23nXUKCjPx46Q21xY2HGZks5uNT6EgaJpZM4VZuTT .

khamkarajinkya commented 6 years ago

Amazing work and repo. Love it . I'm running into similar issues

parameters = n_neighbors = 50, min_dist = 0.05, metric = 'euclidean'

Datatset: 100000 vectors with 300 dimensions

proudquartz commented 6 years ago

Awesome work. I am having the same issue here as well.

Dataset: 102300 vectors with 95 features.

Default parameters produces very nice clusters. But when I increase n_neighbors, the embedding becomes nan.

parameters: n_neighbors = 50, everything else is default

lmcinnes commented 6 years ago

The most recent versions on github should have fixed this -- are you running against that, or something from pip or conda?

On Wed, Oct 10, 2018 at 3:00 PM proudquartz notifications@github.com wrote:

Awesome work. I am having the same issue here as well.

Dataset: 102300 vectors with 95 features.

Default parameters produces very nice clusters. But when I increase n_neighbors, the embedding becomes NA.

parameters: n_neighbors = 50, everything else is default

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/90#issuecomment-428692224, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBdnrCvctONKxyFVMgs5dNzvfoiGOks5ujkPagaJpZM4VZuTT .

proudquartz commented 6 years ago

I was running against the conda version. I will try the most recent version from github. Thanks!

bioguy2018 commented 5 years ago

@lmcinnes Hi I also have faced this problem but I manged to solve it with init=random but is this something that I should worry about in general?

lmcinnes commented 5 years ago

I think it is fixed, and I should push out a patch release -- thanks for the reminder.

bioguy2018 commented 5 years ago

@lmcinnes Thanks a lot just to mention in my case I didn't have any NaN problem but rather this

UserWarning: Embedding a total of 2 separate connected components using meta-embedding (experimental) n_components

but as long as I understood its not an issue right ? Looking forward for the update :) Thanks a lot again

lmcinnes commented 5 years ago

Yes -- that was an experimental feature in 0.3, but it seems to be working well, so you can ignore it.

sleighsoft commented 5 years ago

This might be related: https://github.com/scikit-learn/scikit-learn/issues/13393