lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.39k stars 803 forks source link

RecursionError #99

Closed sametdumankaya closed 5 years ago

sametdumankaya commented 6 years ago

Hi, thanks for the great package.

I have a dataset which has 200000 rows and 15 columns. I tried to apply UMAP as following

embedding = umap.UMAP(n_neighbors=5, min_dist=0.3, metric='correlation').fit_transform(data)

After 10 seconds, I got following exceptions :

I set the system recursion limit to 10000 as below and tried again but then python exited with a code like -143537645 meaning exited with error.

sys.setrecursionlimit(10000)

Is there any solution, workaround or anything I can do for this problem?

Thanks.

lmcinnes commented 6 years ago

This is an odd issue that I never fully tracked down -- it seems to depend on an odd data distribution (often involving duplicate points). What is happening is that the random projection tree recursively splits the data into smaller and smaller pieces. Apparently we hit the recursion limit. In practice we should expect the data to be split approximately in half each time, so the tree depth should be expected to be around log_2(200000) ~ 18. Somehow, instead we have a tree depth that has exceeded 10000, so the splitting is working very strangely.

One potential solution is to add a small amount of noise to the data (smaller than the smallest distances between non-identical samples). This may work around the problem for you.

On Thu, Aug 2, 2018 at 3:59 AM Samet Dumankaya notifications@github.com wrote:

Hi, thanks for the great package.

I have a dataset which has 200000 rows and 15 columns. I tried to apply UMAP as following

embedding = umap.UMAP(n_neighbors=5, min_dist=0.3, metric='correlation').fit_transform(data)

After 10 seconds, I got following exceptions :

  • RecursionError: maximum recursion depth exceeded while calling a Python object
  • return make_angular_tree(data, indices, rng_state, leaf_size) SystemError: CPUDispatcher(<function angular_random_projection_split at 0x000001C8260D6378>) returned a result with an error set

I set the system recursion limit to 10000 as below and tried again but then python exited with a code like -143537645 meaning exited with error.

sys.setrecursionlimit(10000)

Is there any solution, workaround or anything I can do for this problem?

Thanks.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/99, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBS-V8cELGXnEGgpvSLI5AnpHLlZcks5uMrFLgaJpZM4Vrz0B .

diego-vicente commented 6 years ago

I just ran into the same issue, and indeed removing duplicate points did solved the problem. I used a simple workaround, but I am not sure if it is justified to do so or it actually affects the method: can we safely remove the duplicates, embed the unique points and then reconstruct the original array by re-duplicating the embeddings?

My intuitition of the technique is that if N points are exactly the same we could safely work with only one of them and map all of them to the result, but I would like confirmation from someone who actually understands all the math involved. If that assumption is correct, I guess I can start working in a pull request adding the fix.

lmcinnes commented 6 years ago

It is not necessarily ideal. I really need to understand better how duplicate points are resulting in a recursion error to be honest. My understanding is that they should not, but perhaps I have missed a corner case in the implementation somewhere.

On Mon, Aug 27, 2018 at 4:23 AM Diego Vicente notifications@github.com wrote:

I just ran into the same issue, and indeed removing duplicate points did solved the problem. I used a simple workaround, but I am not sure if it is justified to do so or it actually affects the method: can we safely remove the duplicates, embed the unique points and then reconstruct the original array by re-duplicating the embeddings?

My intuitition of the technique is that if N points are exactly the same we could safely work with only one of them and map all of them to the result, but I would like confirmation from someone who actually understands all the math involved. If that assumption is correct, I guess I can start working in a pull request adding the fix.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/99#issuecomment-416153018, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBc3AGRq8Su2qvfo5xiS1OTNMV_MHks5uU6yagaJpZM4Vrz0B .

madkoppa commented 6 years ago

I'm also having this problem. Adding an uncomfortably large amount of noise as a hack also works for me

lmcinnes commented 6 years ago

I think I really need a way to avoid the trees far "bad" data, although I am not sure that NN-descent will do much better with it. Perhaps I'll try to add a max depth to the trees and just accept whatever they give at a certain depth. That might be uncomfortable computationally, but such is life.

On Tue, Sep 4, 2018 at 12:07 PM fistR notifications@github.com wrote:

I'm also having this problem. Adding an uncomfortably large amount of noise as a hack also works for me

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/99#issuecomment-418425991, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBauNf4b4sDVeOvXkxw6l--SXy31Yks5uXqVLgaJpZM4Vrz0B .

madkoppa commented 6 years ago

Just an update, it seems that when it complains about recursion depth but does not crash out, the result gets funky. It is apparent in visualisation purposes. Im trying to track down what the problem vectors are as the amount of noise I need to add for it to work ruins the end result too

lmcinnes commented 6 years ago

I would be very interested to know a set of vectors that specifically cause this problem. It's been a difficult issue to debug, or even to understand to be honest. If you can use the messed up embedding to extract out the actual causal vectors that would be a significant step and I would greatly appreciate such information! Thanks for your patience and persistence in working through this.

On Wed, Sep 5, 2018 at 12:51 PM fistR notifications@github.com wrote:

Just an update, it seems that when it complains about recursion depth but does not crash out, the result gets funky. It is apparent in visualisation purposes. Im trying to track down what the problem vectors are as the amount of noise I need to add for it to work ruins the end result too

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/99#issuecomment-418801919, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBVdeRdvjaVnbGNSD7nhUx6_vXGzPks5uYAEEgaJpZM4Vrz0B .

yizhouyan commented 5 years ago

I encountered the same problem with 17.5M data points.

I used euclidean distance and scaled the data during preprocessing... Here is the code: X_scaled = preprocessing.scale(X) fit = umap.UMAP(n_neighbors=20, min_dist=0.01, metric='euclidean', init='random',n_epochs=500) The same parameter setting works on the sampling data with 100K data points.

Does anyone have any ideas about this?

Thanks!

lmcinnes commented 5 years ago

That's a lot of data points! I would check for duplicate data as that may cause some issues for the random projection trees. Beyond that I'm not entirely sure - -I've never been able to reliably reproduce the error to be able to properly track down the issue.

On Fri, Nov 16, 2018 at 1:49 PM Yizhou notifications@github.com wrote:

I encountered the same problem with 17.5M data points.

I used euclidean distance and scaled the data during preprocessing... Here is the code: X_scaled = preprocessing.scale(X) fit = umap.UMAP(n_neighbors=20, min_dist=0.01, metric='euclidean', init='random',n_epochs=500) The same parameter setting works on the sampling data with 100K data points.

Does anyone have any ideas about this?

Thanks!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/99#issuecomment-439489813, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBRv0aiN48pvT2esx_BqBw-bjmYqJks5uvwizgaJpZM4Vrz0B .

dsaxton commented 5 years ago

I was receiving the same error and adding a small amount of noise did help. Thanks!

scharron commented 5 years ago

@lmcinnes I attached you a 5000x100 numpy array that always makes UMAP crashes. I'm reproducing the error with : umap = UMAP(metric="cosine", n_components=2, min_dist=0.3).fit_transform(data)

data.npy.zip

lmcinnes commented 5 years ago

Thanks @scharron, reproducing examples are very useful. I'll try to look ta this when I get a little more time.

thommiano commented 5 years ago

Experiencing the same issue. Removing duplicates solved my problem, and was fine for my visualization purposes.

lucidyan commented 5 years ago

Same issue here

lmcinnes commented 5 years ago

I think some fixes to other issues may actually resolve this, so I'll try to roll out a patch release in the next few days that will hopefully solve this.

mkarbo commented 5 years ago

Did you roll out your patch yet?

lmcinnes commented 5 years ago

I should have come in the last patch release -- if that didn't fix things then there may be yet more deeper issues that remain elusive.

On Wed, Apr 10, 2019 at 11:40 AM Malthe Karbo notifications@github.com wrote:

Did you roll out your patch yet?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/99#issuecomment-481744183, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBVTKchrOG5G7C8wmsyftMuhZizocks5vfgXugaJpZM4Vrz0B .

jlmelville commented 5 years ago

I have confirmed that @scharron's attached data causes a failure on the latest UMAP with metric="cosine". I don't know if it will reach the recursion limit and give an error as I always terminate the calculation after about 30 minutes of nothing happening.

The culprit seems to be the @numba.njit(fastmath=True) decorating angular_random_projection_split at: https://github.com/lmcinnes/umap/blob/6a32f62e94f9a14b708781819d694730b3716956/umap/rp_tree.py#L28-L29

Set fastmath=False and everything proceeds as normal (except slower). I have confirmed that this also affects the current master of https://github.com/lmcinnes/pynndescent.

I suppose this could also be affecting the Euclidean and sparse versions of the projection split and just waiting for a dataset to trigger them.

The downside of fastmath=False is that there is a noticeable speed decrease of of between 65%-120% for the nearest neighbor search with metric="euclidean", for the datasets I looked at (these are MNIST and bigger in size and dimensionality), although most of them were on the lower end of those numbers. And for metric="cosine", the slow down is less severe, more like 20%, although I haven't tested that very much.

According to the numba docs, it's possible to opt in to a subset of the fast math settings, so that might be worth looking into.

lmcinnes commented 5 years ago

Thanks @jlmelville ! I will have to look into the details of what subset of fastmath I can opt in to, and what the actual offending aspect is. Presumably the issue is in the generation of the hyperplane (resulting in a plane that does not split that data at all, which then gets generated repeatedly). Perhaps an alternative would be to have some sort of bail-out failsafe if no points are separated by a given node of the tree since this does seem to very much be an odd corner case.

jlmelville commented 5 years ago

I looked at trying out some subsets of fastmath, but they always caused the infinite loop.

Looking in more detail at this, the problem seems to arise because of checking float values for 0. There's a lot of it in rp_tree.py, but this line in particular seemed to cause the issue: https://github.com/lmcinnes/umap/blob/6a32f62e94f9a14b708781819d694730b3716956/umap/rp_tree.py#L297

If that changes to if abs(margin < EPS) for some suitable EPS (1.e-8?), the tree building succeeds, but I think we would want to change all such comparisons.

Alternatively, is it the case that make_angular_tree should never create nodes with zero membership, i.e. in:

https://github.com/lmcinnes/umap/blob/6a32f62e94f9a14b708781819d694730b3716956/umap/rp_tree.py#L465-L466

left_node.shape[0] and right_node.shape[0] should always be > 0? The infinite loop seen with @scharron's data is due to one of those nodes having size 0.

You can prevent the loop by bailing out by raiseing a RecursionError, but that's obviously a drastic step because I think it causes the entire RP forest routine to abort and you start the nearest neighbor descent with a random initialization? Would it be ok to just arbitrarily split the remaining nodes into two equally sized nodes at this point? This would allow the sizes to eventually reach leaf_size and the algorithm to proceed in the event of there being more than leaf_size exact duplicates in the data set, although I am not yet familiar with the implementation details to see if that will cause problems down the line. I think it would just affect the neighbor list for those points as it gets passed into the nearest neighbor descent routine. If so, that would definitely be a better initialization strategy than the RecursionError path.

I was hoping to try out some of these solutions and submit a pr through pynndescent first, but I am getting failing unit tests on that project at the moment. @lmcinnes, I am happy to work on a pr for this here if you have a sense of which (or both) of these strategies to pursue.

lmcinnes commented 5 years ago

Yes, the tree should never create nodes of size zero, so that is certainly part of the problem. I guess the question is where is the best place to catch this. Your observation that the margin == 0 line is a source of issue is potentially the right place to catch things. Here is my suspicion as to what is happening:

The data contains a number of points that, while distinct and non-zero, are all scalar multiples of each other. Thus, they all lie on a line from the origin, and an angular hyperplane cannot possibly separate them. Ideally the margin == 0 case was supposed to catch this. If points were identically aligned with the hyperplane (as would be the case for such points, as the generated hyperplane would be identically zero if the node contained only such points) they would produce a zero margin and then we can randomly assign them one way or the other. I suspect the fast_math is producing enough rounding error despite dot producting with the all zero vector, all the margins are slightly one side or the other of zero and the end result is infinite recursion.

Now, there are a few ways to fix this: we can fix the margin comparison, as you have done to handle issues with floats better (and make the similar required changes to the other splitting routines); we can go after the root of the problem (a bad hyperplane; all zeros for the hyperplane vector in all the cases I believe), and if that occurs then just automate the random splitting; finally we could take the other option you suggest and catch splits that result in zero points in one side and insert a forced random partitioning into two nodes there.

The first way should work well enough, and seems mostly clean aside from the catch that we'll very occasionally mis-assign points that are very close to the hyperplane margin. The second approach gets to the heart of the actual issue, but still requires actually detecting an all zero hyperplane (up to float tolerances). The third option is probably the most robust in that if there are other reasons why the split can go astray (and there may be) it will still catch them. It is a more awkward fix to implement however.

To be honest I'm happiest with option 1 (fix the margin comparison). It will get the job done with the least amount of fuss. I'll see what the issues with pynndescent are -- the tests are all a bit new, so it could well be errors in the testing more than anything.

radames commented 5 years ago

The latest fixes merged on master seems to fix the problem here to my tests on ~100k points (n_neighbors=20, metric='cosine', min_dist=0.4 ,init='random', verbose=2) !

lmcinnes commented 5 years ago

That's greta news. I'll try to put out a release soon.

On Sun, Apr 28, 2019 at 2:17 AM Radamés Ajna notifications@github.com wrote:

The latest fixes merged on master seems to fix the problem here to my tests on ~100k points (n_neighbors=20, metric='cosine', min_dist=0.4 ,init='random', verbose=2) !

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/99#issuecomment-487348407, or mute the thread https://github.com/notifications/unsubscribe-auth/AC3IUBLZUESA4KABDSMVPKTPSU6OFANCNFSM4FNPHUAQ .