lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.45k stars 808 forks source link

Running UMAP (?) on 10m-100m sized datasets? #125

Open snakers4 opened 6 years ago

snakers4 commented 6 years ago

Hi again, everyone! Would like to ask for some advice, probably UMAP related, probably more abstract.

I have had some success using this combination of tools in my work:

All of this fine and dandy, but from my experiments it seems that with datasets around 10m points I start having trouble with the above scheme.

Obviously I can try different tricks (feeding HDBSCAN from PCA directly, reducing dimensions more aggressively e.g. 300=>20=>10, sub-sampling data and then just applying .transform, buying more :ram: , etc).

But maybe someone knows if there are any approaches for producing high-quality embeddings in a parallelized / mini-batch fashion? If so, can you probably point me in the correct direction? I know, that people train matrix factorization machines on the GPU, maybe there is something similar?

Or maybe I am just missing something with my setup?

I understand, that in the long run, the correct approach is

Many thanks!

snakers4 commented 6 years ago

So far I had the following results:

lmcinnes commented 6 years ago

It should be noted that .transform is not going to be more efficient than just running on the full dataset -- it is meant to allow you to introduce new data, not for scaling. I would also note that you probably don't need the PCA step. I suspect UMAP will not be that much slower on the full dimensional space. The ultimate limiting factors here are time and RAM. There is some work on a distributed version of UMAP, but it is very preliminary, so don't expect anything soon; that might fix the RAM issue eventually. There is also some work on a GPU implementation of UMAP, but again, that is very preliminary; that might fix the time issue. I'm not sure you can easily fix both at once. Scaling to 10m data points is just flat out difficult without some beefy hardware and a lot of time.

On Sat, Aug 18, 2018 at 10:13 AM Alexander Veysov notifications@github.com wrote:

So far I had the following results:

  • I could successfully run UMAP on 3m * 30 data set to reduce it to 2 dimensions;
  • After loading the UMAP object I tried to apply .transform to 10m items => failed with some non-tractable error;
  • Tried applying .transform in 10 smaller chunks (1m) - but also failed - ran script for ~7-8 hours, in the end it looked like the process was not responding;

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/125#issuecomment-414060877, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBbj4YsBmb-C0c4kV8GMwcpMjovr7ks5uSCDzgaJpZM4WBl-g .

snakers4 commented 6 years ago

Many thanks for your reply.

Clarifications

It should be noted that .transform is not going to be more efficient than just running on the full dataset -- it is meant to allow you to introduce new data, not for scaling.

Many thanks for clarifying this.

The ultimate limiting factors here are time and RAM

Well, we have 64 GB or RAM and 16 3GHz cores, maybe going to 128 is an option. If this is not a secret, do you use similarly sized machines to test UMAP yourself?

I would also note that you probably don't need the PCA step I suspect UMAP will not be that much slower on the full dimensional space

Strangely enough, when I just tried doing it without PCA, it did not work. Of course, I started by embedding much smaller datasets (50k, 300k points). I do not really remember the reason - it was either out of memory error or I just could not wait until the graph constructions finishes or some cryptic error like killed or segmentation fault.

Update On 3m points - using whole embedding (300 dims) is bottle-necked by memory.

Running UMAP

Some additional results in running UMAP

Discussion

I googled a bit, and I kind of think I may have found a hacky way around. @lmcinnes I would be really happy, if you could provide your opinion on this topic. The below is not particularly beautiful, but it may work in practice.

It seems that if you an "unlimited" supply of GPUs, something like this can be arranged:

All in all, I am considering to try applying something like this:

  1. Create some naive clusters (e.g. run some naive K-Means with a large number of clusters) and / or use the techniques described in papers (i.e. random projection trees). Essentially this step is required to ease the burden on the next step;
  2. Build a KNN graph by essentially brute-forcing distance calculation by batching it on the GPUs. PyTorch allows to do this really easily - just properly batching data is required;
  3. Apply LargeVis to this graph;

Do you think such approach may work?

snakers4 commented 6 years ago

Also did some back-of-the-envelope calculations, which seem promising so far:

lmcinnes commented 6 years ago

For KNN I would skip trying to brute force it and just use FAISS IVF which has a GPU implementation.

I would also note that the optimize_layout implementation in UMAP is probably more efficient than that of LargeVis, so working from that for a GPU implementation may be more sensible. Alternatively you can just call it directly with the resulting KNN graph. Keep in mind, by the way, that you'll want to make use of UMAP functions to appropriately weight the KNN graph.

Note that part of what makes this SGD difficult is that the ground is moving under your feet so to speak -- the target changes as you change the parameters.

snakers4 commented 6 years ago

For KNN I would skip trying to brute force it and just use FAISS IVF which has a GPU implementation.

Indeed, they have KNN search + multi-GPU as an option. Many thanks for pointing this out.

snakers4 commented 6 years ago

@lmcinnes

So far, I managed to do the following

graph = fuzzy_simplicial_set( X=bc_vectors_sample, n_neighbors=100, random_state=np.random.RandomState(seed=42), metric='euclidean', metric_kwds={}, knn_indices=knn_indices, knn_dists=knn_dists, angular=False, set_op_mix_ratio=1.0, local_connectivity=1.0, verbose=True, )

graph = graph.tocoo() graph.sum_duplicates() n_vertices = graph.shape[1]

n_epochs = 200

graph.data[graph.data < (graph.data.max() / float(n_epochs))] = 0.0 graph.eliminate_zeros() epochs_per_sample = make_epochs_per_sample(graph.data, 200)


It turns out that `graph.data` contains only ones and zeroes and `epochs_per_sample` therefore explodes and the max value is `inf`. Does this mean that I have to pre-process the KNN graph somehow before feeding it like this, or am I making some error?

**As for tests:**
- Creating a spectral embedding seems feasible only for datasets below 1-2m points;
- For larger datasets, memory starts to become an issue (I doubt that library users will realistically have more than 128 GB of RAM);
- As for faiss, my results were the following:
  - Plain index does not fit on one conventional GPU;
  - Approximate index that is supported on GPU takes ~500MB of GPU RAM (17% of RAM are reserved) => I guess it can even be scaled much and much further;
  - KNN graph for ~7-10m points is calculated ~2 hours tops;
  - I guess it can be scaled up to 100m of points this way easily;
lmcinnes commented 6 years ago

The graph should not only be binary ones and zeros, so something has gone astray here somewhere. I don't see anythign obviously wrong in what youa re doing so it is hard to know exactly where things went astray. I would check the content of the graph immediately coming out of fuzzy_simplicial_set.

On Sun, Aug 26, 2018 at 8:48 AM Alexander Veysov notifications@github.com wrote:

@lmcinnes https://github.com/lmcinnes

So far, I managed to do the following

knn_indices is a knn graph from faiss (n_samples, n_neighbors)

knn_dists is knn distances from faiss (n_samples, n_neighbors)

graph = fuzzy_simplicial_set( X=bc_vectors_sample, n_neighbors=100, random_state=np.random.RandomState(seed=42), metric='euclidean', metric_kwds={}, knn_indices=knn_indices, knn_dists=knn_dists, angular=False, set_op_mix_ratio=1.0, local_connectivity=1.0, verbose=True, )

graph = graph.tocoo() graph.sum_duplicates() n_vertices = graph.shape[1]

n_epochs = 200

graph.data[graph.data < (graph.data.max() / float(n_epochs))] = 0.0 graph.eliminate_zeros() epochs_per_sample = make_epochs_per_sample(graph.data, 200)

It turns out that graph.data contains only ones and zeroes and epochs_per_sample therefore explodes and the max value is inf. Does this mean that I have to pre-process the KNN graph somehow before feeding it like this, or am I making some error?

As for tests:

  • Creating a spectral embedding seems feasible only for datasets below 1-2m points;
  • For larger datasets, memory starts to become an issue (I doubt that library users will realistically have more than 128 GB of RAM);
  • As for faiss, my results were the following:
    • Plain index does not fit on one conventional GPU;
    • Approximate index that is supported on GPU takes ~500MB of GPU RAM (17% of RAM are reserved) => I guess it can even be scaled much and much further;
    • KNN graph for ~7-10m points is calculated ~2 hours tops;
    • I guess it can be scaled up to 100m of points this way easily;

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/125#issuecomment-416036639, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBdXRO-yjOTYwrh4wLUfY4JRM4jI5ks5uUpkKgaJpZM4WBl-g .

snakers4 commented 6 years ago

coming out of fuzzy_simplicial_set

It is also ones and zeros. What is interesting - I compared your KNN graph and faiss KNN graph - they seem almost identical. I guess the devil lies somewhere in the data types (I believe I save my graph knn indices in float). I will check several times and revert.