lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.33k stars 796 forks source link

High memory usage when pynndescent is not installed #379

Open gclen opened 4 years ago

gclen commented 4 years ago

Using UMAP on a small dataset (20 newsgroups), ran my machine of memory (56GB of RAM). However, when I installed pynndescent, this issue went away. I had installed UMAP via

pip install umap-learn --pre

and the code to reproduce it is

import pandas as pd
import umap

# Used to get the data
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

dataset = fetch_20newsgroups(subset='all', shuffle=True, random_state=42)

vectorizer = CountVectorizer(min_df=5, stop_words='english')
word_doc_matrix = vectorizer.fit_transform(dataset.data)

embedding = umap.UMAP(n_components=2, metric='hellinger').fit(word_doc_matrix)
lmcinnes commented 4 years ago

Sadly you need the low_memory=True option. All this will be fixed properly in 0.5 when pynndescent becomes an explicit dependency. In the meantime the FAQ should have some notes on this: https://umap-learn.readthedocs.io/en/latest/faq.html#i-ran-out-of-memory-help but I would welcome suggestions for better more obvious places to document this. It is largely an issue with sparse data and angular metrics.