Running UMAP (?) on 10m-100m sized datasets?

snakers4 commented 6 years ago

Hi again, everyone! Would like to ask for some advice, probably UMAP related, probably more abstract.

I have had some success using this combination of tools in my work:

Get a lot of data (~2m rows * 300 columns (word embeddings));
Use PCA to reduce dimensions to 30-50;
Then use UMAP to reduce to 10;
Then use HDBSCAN for clustering;
Then use this to enrich the data and some naive affinity propagation / trees for final post-processing to make the data usable for end-users;

All of this fine and dandy, but from my experiments it seems that with datasets around 10m points I start having trouble with the above scheme.

Obviously I can try different tricks (feeding HDBSCAN from PCA directly, reducing dimensions more aggressively e.g. 300=>20=>10, sub-sampling data and then just applying .transform, buying more :ram: , etc).

But maybe someone knows if there are any approaches for producing high-quality embeddings in a parallelized / mini-batch fashion? If so, can you probably point me in the correct direction? I know, that people train matrix factorization machines on the GPU, maybe there is something similar?

Or maybe I am just missing something with my setup?

I understand, that in the long run, the correct approach is

Get annotation;
Find the best CNN and train a classifier;
Afaik, it may even work with up to ~5,000 classes;

Many thanks!

snakers4 commented 6 years ago

So far I had the following results:

I could successfully run UMAP on 3m * 30 data set to reduce it to 2 dimensions;
After loading the UMAP object I tried to apply .transform to 10m items => failed with some non-tractable error;
Tried applying .transform in 10 smaller chunks (1m) - but also failed - ran script for ~7-8 hours, in the end it looked like the process was not responding;

lmcinnes commented 6 years ago

It should be noted that .transform is not going to be more efficient than just running on the full dataset -- it is meant to allow you to introduce new data, not for scaling. I would also note that you probably don't need the PCA step. I suspect UMAP will not be that much slower on the full dimensional space. The ultimate limiting factors here are time and RAM. There is some work on a distributed version of UMAP, but it is very preliminary, so don't expect anything soon; that might fix the RAM issue eventually. There is also some work on a GPU implementation of UMAP, but again, that is very preliminary; that might fix the time issue. I'm not sure you can easily fix both at once. Scaling to 10m data points is just flat out difficult without some beefy hardware and a lot of time.

On Sat, Aug 18, 2018 at 10:13 AM Alexander Veysov notifications@github.com wrote:

So far I had the following results:

I could successfully run UMAP on 3m * 30 data set to reduce it to 2 dimensions;

After loading the UMAP object I tried to apply .transform to 10m items => failed with some non-tractable error;

Tried applying .transform in 10 smaller chunks (1m) - but also failed - ran script for ~7-8 hours, in the end it looked like the process was not responding;

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/125#issuecomment-414060877, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBbj4YsBmb-C0c4kV8GMwcpMjovr7ks5uSCDzgaJpZM4WBl-g .

snakers4 commented 6 years ago

Many thanks for your reply.

Clarifications

It should be noted that .transform is not going to be more efficient than just running on the full dataset -- it is meant to allow you to introduce new data, not for scaling.

Many thanks for clarifying this.

The ultimate limiting factors here are time and RAM

Well, we have 64 GB or RAM and 16 3GHz cores, maybe going to 128 is an option. If this is not a secret, do you use similarly sized machines to test UMAP yourself?

I would also note that you probably don't need the PCA step I suspect UMAP will not be that much slower on the full dimensional space

Strangely enough, when I just tried doing it without PCA, it did not work. Of course, I started by embedding much smaller datasets (50k, 300k points). I do not really remember the reason - it was either out of memory error or I just could not wait until the graph constructions finishes or some cryptic error like killed or segmentation fault.

Update On 3m points - using whole embedding (300 dims) is bottle-necked by memory.

Running UMAP

Some additional results in running UMAP

spectral embedding produces out of memory error, random seems to work with 3m points in my hardware;

Discussion

I googled a bit, and I kind of think I may have found a hacky way around. @lmcinnes I would be really happy, if you could provide your opinion on this topic. The below is not particularly beautiful, but it may work in practice.

It seems that if you an "unlimited" supply of GPUs, something like this can be arranged:

Some folks from Europe managed to run LargeVis on several GPUs as well as a naive brute-force KNN tree construction method with bucket sort algorithm. As far as I can understand from their paper, their GPU implementation is low-level and they manage memory directly;
Facebook did a similar thing with low-level C++ code and they even published a library, which only calculates distances and does some plain algorithms like K-Means. But their purpose was billions of points, not tens of millions;
LargeVis paper seems to be "back-propable";

All in all, I am considering to try applying something like this:

Create some naive clusters (e.g. run some naive K-Means with a large number of clusters) and / or use the techniques described in papers (i.e. random projection trees). Essentially this step is required to ease the burden on the next step;
Build a KNN graph by essentially brute-forcing distance calculation by batching it on the GPUs. PyTorch allows to do this really easily - just properly batching data is required;
Apply LargeVis to this graph;

Do you think such approach may work?

snakers4 commented 6 years ago

Also did some back-of-the-envelope calculations, which seem promising so far:

If we have 10m points with 10k clusters (1k points in cluster), then we need to calculate roughly N*10k*10k calculations to approximate the KNN graph, which will be doable, if ~10k operations can be done in one go on one GPU (usually 1 batch takes < 1s). Then with multiple GPUs even brute-forcing will take a small amount of time (i.e. hours);
Also if we have 10m points and we want to create a 10-dimension embeddings (if I am not getting the LargeVis paper wrong) - then essentially this is only ~400 MB of parameters, similar in size to old school CNNs like VGG, i.e. feasible with some room;

lmcinnes commented 6 years ago

For KNN I would skip trying to brute force it and just use FAISS IVF which has a GPU implementation.

I would also note that the optimize_layout implementation in UMAP is probably more efficient than that of LargeVis, so working from that for a GPU implementation may be more sensible. Alternatively you can just call it directly with the resulting KNN graph. Keep in mind, by the way, that you'll want to make use of UMAP functions to appropriately weight the KNN graph.

Note that part of what makes this SGD difficult is that the ground is moving under your feet so to speak -- the target changes as you change the parameters.

snakers4 commented 6 years ago

For KNN I would skip trying to brute force it and just use FAISS IVF which has a GPU implementation.

Indeed, they have KNN search + multi-GPU as an option. Many thanks for pointing this out.

snakers4 commented 6 years ago

@lmcinnes

So far, I managed to do the following

Build a KNN graph using faiss - here is a gist that I assembled - maybe someone will find it useful;
Feed this knn graph to get these objects (the most interesting part in terms of porting it into PyTorch obviously is how you do weighting)
```
# knn_indices is a knn graph from faiss (n_samples, n_neighbors)
# knn_dists is knn distances from faiss (n_samples, n_neighbors)
```

graph = fuzzy_simplicial_set( X=bc_vectors_sample, n_neighbors=100, random_state=np.random.RandomState(seed=42), metric='euclidean', metric_kwds={}, knn_indices=knn_indices, knn_dists=knn_dists, angular=False, set_op_mix_ratio=1.0, local_connectivity=1.0, verbose=True, )

graph = graph.tocoo() graph.sum_duplicates() n_vertices = graph.shape[1]

n_epochs = 200

graph.data[graph.data < (graph.data.max() / float(n_epochs))] = 0.0 graph.eliminate_zeros() epochs_per_sample = make_epochs_per_sample(graph.data, 200)


It turns out that `graph.data` contains only ones and zeroes and `epochs_per_sample` therefore explodes and the max value is `inf`. Does this mean that I have to pre-process the KNN graph somehow before feeding it like this, or am I making some error?

**As for tests:**
- Creating a spectral embedding seems feasible only for datasets below 1-2m points;
- For larger datasets, memory starts to become an issue (I doubt that library users will realistically have more than 128 GB of RAM);
- As for faiss, my results were the following:
  - Plain index does not fit on one conventional GPU;
  - Approximate index that is supported on GPU takes ~500MB of GPU RAM (17% of RAM are reserved) => I guess it can even be scaled much and much further;
  - KNN graph for ~7-10m points is calculated ~2 hours tops;
  - I guess it can be scaled up to 100m of points this way easily;

lmcinnes commented 6 years ago

The graph should not only be binary ones and zeros, so something has gone astray here somewhere. I don't see anythign obviously wrong in what youa re doing so it is hard to know exactly where things went astray. I would check the content of the graph immediately coming out of fuzzy_simplicial_set.

On Sun, Aug 26, 2018 at 8:48 AM Alexander Veysov notifications@github.com wrote:

@lmcinnes https://github.com/lmcinnes

So far, I managed to do the following

Build a KNN graph using faiss - here is a gist https://gist.github.com/snakers4/9e834c97e3f689aeb68d31b16f93be32 that I assembled - maybe someone will find it useful;

Feed this knn graph to get these objects (the most interesting part in terms of porting it into PyTorch obviously is how you do weighting)

knn_indices is a knn graph from faiss (n_samples, n_neighbors)

knn_dists is knn distances from faiss (n_samples, n_neighbors)

graph = fuzzy_simplicial_set( X=bc_vectors_sample, n_neighbors=100, random_state=np.random.RandomState(seed=42), metric='euclidean', metric_kwds={}, knn_indices=knn_indices, knn_dists=knn_dists, angular=False, set_op_mix_ratio=1.0, local_connectivity=1.0, verbose=True, )

graph = graph.tocoo() graph.sum_duplicates() n_vertices = graph.shape[1]

n_epochs = 200

graph.data[graph.data < (graph.data.max() / float(n_epochs))] = 0.0 graph.eliminate_zeros() epochs_per_sample = make_epochs_per_sample(graph.data, 200)

It turns out that graph.data contains only ones and zeroes and epochs_per_sample therefore explodes and the max value is inf. Does this mean that I have to pre-process the KNN graph somehow before feeding it like this, or am I making some error?

As for tests:

Creating a spectral embedding seems feasible only for datasets below 1-2m points;

For larger datasets, memory starts to become an issue (I doubt that library users will realistically have more than 128 GB of RAM);

As for faiss, my results were the following:

Plain index does not fit on one conventional GPU;

Approximate index that is supported on GPU takes ~500MB of GPU RAM (17% of RAM are reserved) => I guess it can even be scaled much and much further;

KNN graph for ~7-10m points is calculated ~2 hours tops;

I guess it can be scaled up to 100m of points this way easily;

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/125#issuecomment-416036639, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBdXRO-yjOTYwrh4wLUfY4JRM4jI5ks5uUpkKgaJpZM4WBl-g .

snakers4 commented 6 years ago

coming out of fuzzy_simplicial_set

It is also ones and zeros. What is interesting - I compared your KNN graph and faiss KNN graph - they seem almost identical. I guess the devil lies somewhere in the data types (I believe I save my graph knn indices in float). I will check several times and revert.

lmcinnes / umap