lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.42k stars 806 forks source link

Crash when running with larger dataset #416

Open DicksonK opened 4 years ago

DicksonK commented 4 years ago

Hello,

I am having some trouble running umap with a large data-set (0.75m with 75 columns) on 0.4.1 After running around 10-15mins, the python session just crash.

I was able to run the exact same things on a slightly bigger data-set (2.5m with 75 columns) on 0.3.10.

Cheers,

lmcinnes commented 4 years ago

That sounds troubling, but I can't say too much without a little more information. Presumably the whole thing is segfaulting somewhere inside numba's workload. Are you using any different metrics, or is this with the euclidean metric? There have been some reports of possible issues in mahalanobis and/or correlation.

DicksonK commented 4 years ago

I was using all the default parameter which is euclidean.

Here is the verbose output before it crash:

UMAP(a=None, angular_rp_forest=False, b=None,
     force_approximation_algorithm=False, init='spectral', learning_rate=1.0,
     local_connectivity=1.0, low_memory=False, metric='euclidean',
     metric_kwds=None, min_dist=0.1, n_components=2, n_epochs=None,
     n_neighbors=15, negative_sample_rate=5, output_metric='euclidean',
     output_metric_kwds=None, random_state=None, repulsion_strength=1.0,
     set_op_mix_ratio=1.0, spread=1.0, target_metric='categorical',
     target_metric_kwds=None, target_n_neighbors=-1, target_weight=0.5,
     transform_queue_size=4.0, transform_seed=42, unique=False, verbose=True)
Construct fuzzy simplicial set
Wed Apr 29 02:30:21 2020 Finding Nearest Neighbors
Wed Apr 29 02:30:21 2020 Building RP forest with 48 trees
Wed Apr 29 02:32:40 2020 NN descent for 20 iterations
lmcinnes commented 4 years ago

Any chance you could try installing pynndescent and see if that makes any difference?

DicksonK commented 4 years ago

It works after pynndescent was installed (tested on 2.5m x 75). Also much better performance as well.

0.3.10 (w/o pynndescent)

CPU times: user 2h 24min 10s, sys: 49.7 s, total: 2h 25min
Wall time: 2h 22min 37s

0.4.1 (w/ pynndescent)

CPU times: user 3h 29min 4s, sys: 1min 17s, total: 3h 30min 21s
Wall time: 21min 30s
UMAP(a=None, angular_rp_forest=False, b=None,
     force_approximation_algorithm=False, init='spectral', learning_rate=1.0,
     local_connectivity=1.0, low_memory=False, metric='euclidean',
     metric_kwds=None, min_dist=0.1, n_components=2, n_epochs=None,
     n_neighbors=15, negative_sample_rate=5, output_metric='euclidean',
     output_metric_kwds=None, random_state=None, repulsion_strength=1.0,
     set_op_mix_ratio=1.0, spread=1.0, target_metric='categorical',
     target_metric_kwds=None, target_n_neighbors=-1, target_weight=0.5,
     transform_queue_size=4.0, transform_seed=42, unique=False, verbose=True)
Construct fuzzy simplicial set
Wed Apr 29 14:01:29 2020 Finding Nearest Neighbors
Wed Apr 29 14:01:29 2020 Building RP forest with 82 trees
Wed Apr 29 14:02:53 2020 NN descent for 21 iterations
     0  /  21
     1  /  21
     2  /  21
     3  /  21
     4  /  21
     5  /  21
     6  /  21
/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pynndescent/pynndescent_.py:1155: RuntimeWarning: invalid value encountered in sqrt
  self._distance_correction(self._neighbor_graph[1]),
Wed Apr 29 14:09:49 2020 Finished Nearest Neighbor Search
Wed Apr 29 14:10:15 2020 Construct embedding
    completed  0  /  200 epochs
    completed  20  /  200 epochs
    completed  40  /  200 epochs
    completed  60  /  200 epochs
    completed  80  /  200 epochs
    completed  100  /  200 epochs
    completed  120  /  200 epochs
    completed  140  /  200 epochs
    completed  160  /  200 epochs
    completed  180  /  200 epochs
Wed Apr 29 14:22:55 2020 Finished embedding
lmcinnes commented 4 years ago

I'm afraid this may have to suffice as a workaround for now -- I'll try to figure out what the issue might be, but it will likely be hard to track down, so it will take some time.

AlexMRuch commented 4 years ago

I wonder if this is the same issue I'm having in #430. I'll try the pynndescent resolution too.

AlexMRuch commented 4 years ago

Installed pynndescent=0.3.3 and my pipeline still failed at exactly the same place as before, ughhhhh :-( I'll return to posting my updates on #430.

AlexMRuch commented 4 years ago

So pynndescent=0.4.7 worked for me when installed in an env with just

- ipykernel
- seaborn
- pandas
- numba
- hdbscan
- umap-learn
- pynndescent

So not sure exactly what's going on. The env I have now has numba 0.46.0. Either way, it's going now and it's going fast 😄🎉

lmcinnes commented 4 years ago

I'm glad it is working. The crash is very puzzling. I am seeing some crash issues with a new metric I am implementing in pynndescent (it won't be the cause of your issues) that are very hard to track down, but are, possibly, stemming from a similar root cause. I'll let you know if I manage to find something reproducible on my end that might solve the problem more permanently for you.

AlexMRuch commented 4 years ago

Glad to hear that I was able to confirm your suggestion worked at least as a temp fix for others! Thank you for this amazing library and all your hard work!

VolkerBergen commented 4 years ago

Installing pynndescent solved it for us as well. Worth adding it to the requirements or shouldn't it be a hard dependency?

eafpres commented 4 years ago

I had this problem and found it was solved by scaling the data using sklearn StandardScaler.

seniordatascientist commented 1 year ago

I have the same issue. Silent exit. In my case UMAP with cosine fails if I use robust scaler, but works if I use minmax or standard scaler. python 3.8.12 umap_model = umap.UMAP(n_components=6, verbose=True, metric='cosine', low_memory=True)