lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.23k stars 787 forks source link

Random Projection forest initialisation failed #583

Open huidongchen opened 3 years ago

huidongchen commented 3 years ago

Hi,

Since UMAP v5, I have been getting the following warning quite often when dealing with large dataset (>10k points, ~100 features) and it will end up getting stuck.. (I am running it on sever so i have almost unlimited computing resources)

./myenv/lib/python3.7/site-packages/pynndescent/rp_trees.py:1005: UserWarning: Random Projection forest initialisation failed due to recursionlimit being reached. Something is a little strange with your graph_data, and this may take longer than normal to compute. "Random Projection forest initialisation failed due to recursion"

Any idea how to solve this issue? I would really appreciate your help. Thanks!

lmcinnes commented 3 years ago

The most likely candidate is a great deal of duplicate records; or records that essentially look identical up to floating point precision.

huidongchen commented 3 years ago

Thanks so much for your quick reply! That’s good to know. If that’s the case, is there a way to get around this problem?

lmcinnes commented 3 years ago

The only thing I can think of is to deduplicate the dataset by whatever means you have. At the very least checking for duplicates would be a good start. If you have unlimited compute and enough memory you could also just compute the full distance matrix of the data and pass that in with metric="precomputed" and that should at least run.