Open snakers4 opened 6 years ago
So far I had the following results:
UMAP
on 3m * 30 data set to reduce it to 2 dimensions;UMAP
object I tried to apply .transform
to 10m items => failed with some non-tractable error;.transform
in 10 smaller chunks (1m) - but also failed - ran script for ~7-8 hours, in the end it looked like the process was not responding;It should be noted that .transform
is not going to be more efficient
than just running on the full dataset -- it is meant to allow you to
introduce new data, not for scaling. I would also note that you probably
don't need the PCA step. I suspect UMAP will not be that much slower on the
full dimensional space. The ultimate limiting factors here are time and
RAM. There is some work on a distributed version of UMAP, but it is very
preliminary, so don't expect anything soon; that might fix the RAM issue
eventually. There is also some work on a GPU implementation of UMAP, but
again, that is very preliminary; that might fix the time issue. I'm not
sure you can easily fix both at once. Scaling to 10m data points is just
flat out difficult without some beefy hardware and a lot of time.
On Sat, Aug 18, 2018 at 10:13 AM Alexander Veysov notifications@github.com wrote:
So far I had the following results:
- I could successfully run UMAP on 3m * 30 data set to reduce it to 2 dimensions;
- After loading the UMAP object I tried to apply .transform to 10m items => failed with some non-tractable error;
- Tried applying .transform in 10 smaller chunks (1m) - but also failed - ran script for ~7-8 hours, in the end it looked like the process was not responding;
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/125#issuecomment-414060877, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBbj4YsBmb-C0c4kV8GMwcpMjovr7ks5uSCDzgaJpZM4WBl-g .
Many thanks for your reply.
It should be noted that
.transform
is not going to be more efficient than just running on the full dataset -- it is meant to allow you to introduce new data, not for scaling.
Many thanks for clarifying this.
The ultimate limiting factors here are time and RAM
Well, we have 64 GB or RAM and 16 3GHz cores, maybe going to 128 is an option. If this is not a secret, do you use similarly sized machines to test UMAP yourself?
I would also note that you probably don't need the PCA step I suspect UMAP will not be that much slower on the full dimensional space
Strangely enough, when I just tried doing it without PCA, it did not work. Of course, I started by embedding much smaller datasets (50k, 300k points). I do not really remember the reason - it was either out of memory
error or I just could not wait until the graph constructions finishes or some cryptic error like killed
or segmentation fault
.
Update On 3m points - using whole embedding (300 dims) is bottle-necked by memory.
Some additional results in running UMAP
spectral
embedding produces out of memory
error, random
seems to work with 3m points in my hardware;I googled a bit, and I kind of think I may have found a hacky way around. @lmcinnes I would be really happy, if you could provide your opinion on this topic. The below is not particularly beautiful, but it may work in practice.
It seems that if you an "unlimited" supply of GPUs, something like this can be arranged:
All in all, I am considering to try applying something like this:
Do you think such approach may work?
Also did some back-of-the-envelope calculations, which seem promising so far:
N*10k*10k
calculations to approximate the KNN graph, which will be doable, if ~10k operations can be done in one go on one GPU (usually 1 batch takes < 1s). Then with multiple GPUs even brute-forcing will take a small amount of time (i.e. hours);For KNN I would skip trying to brute force it and just use FAISS IVF which has a GPU implementation.
I would also note that the optimize_layout
implementation in UMAP is probably more efficient than that of LargeVis, so working from that for a GPU implementation may be more sensible. Alternatively you can just call it directly with the resulting KNN graph. Keep in mind, by the way, that you'll want to make use of UMAP functions to appropriately weight the KNN graph.
Note that part of what makes this SGD difficult is that the ground is moving under your feet so to speak -- the target changes as you change the parameters.
For KNN I would skip trying to brute force it and just use FAISS IVF which has a GPU implementation.
Indeed, they have KNN search + multi-GPU as an option. Many thanks for pointing this out.
@lmcinnes
So far, I managed to do the following
# knn_indices is a knn graph from faiss (n_samples, n_neighbors)
# knn_dists is knn distances from faiss (n_samples, n_neighbors)
graph = fuzzy_simplicial_set( X=bc_vectors_sample, n_neighbors=100, random_state=np.random.RandomState(seed=42), metric='euclidean', metric_kwds={}, knn_indices=knn_indices, knn_dists=knn_dists, angular=False, set_op_mix_ratio=1.0, local_connectivity=1.0, verbose=True, )
graph = graph.tocoo() graph.sum_duplicates() n_vertices = graph.shape[1]
n_epochs = 200
graph.data[graph.data < (graph.data.max() / float(n_epochs))] = 0.0 graph.eliminate_zeros() epochs_per_sample = make_epochs_per_sample(graph.data, 200)
It turns out that `graph.data` contains only ones and zeroes and `epochs_per_sample` therefore explodes and the max value is `inf`. Does this mean that I have to pre-process the KNN graph somehow before feeding it like this, or am I making some error?
**As for tests:**
- Creating a spectral embedding seems feasible only for datasets below 1-2m points;
- For larger datasets, memory starts to become an issue (I doubt that library users will realistically have more than 128 GB of RAM);
- As for faiss, my results were the following:
- Plain index does not fit on one conventional GPU;
- Approximate index that is supported on GPU takes ~500MB of GPU RAM (17% of RAM are reserved) => I guess it can even be scaled much and much further;
- KNN graph for ~7-10m points is calculated ~2 hours tops;
- I guess it can be scaled up to 100m of points this way easily;
The graph should not only be binary ones and zeros, so something has gone astray here somewhere. I don't see anythign obviously wrong in what youa re doing so it is hard to know exactly where things went astray. I would check the content of the graph immediately coming out of fuzzy_simplicial_set.
On Sun, Aug 26, 2018 at 8:48 AM Alexander Veysov notifications@github.com wrote:
@lmcinnes https://github.com/lmcinnes
So far, I managed to do the following
- Build a KNN graph using faiss - here is a gist https://gist.github.com/snakers4/9e834c97e3f689aeb68d31b16f93be32 that I assembled - maybe someone will find it useful;
- Feed this knn graph to get these objects (the most interesting part in terms of porting it into PyTorch obviously is how you do weighting)
knn_indices is a knn graph from faiss (n_samples, n_neighbors)
knn_dists is knn distances from faiss (n_samples, n_neighbors)
graph = fuzzy_simplicial_set( X=bc_vectors_sample, n_neighbors=100, random_state=np.random.RandomState(seed=42), metric='euclidean', metric_kwds={}, knn_indices=knn_indices, knn_dists=knn_dists, angular=False, set_op_mix_ratio=1.0, local_connectivity=1.0, verbose=True, )
graph = graph.tocoo() graph.sum_duplicates() n_vertices = graph.shape[1]
n_epochs = 200
graph.data[graph.data < (graph.data.max() / float(n_epochs))] = 0.0 graph.eliminate_zeros() epochs_per_sample = make_epochs_per_sample(graph.data, 200)
It turns out that graph.data contains only ones and zeroes and epochs_per_sample therefore explodes and the max value is inf. Does this mean that I have to pre-process the KNN graph somehow before feeding it like this, or am I making some error?
As for tests:
- Creating a spectral embedding seems feasible only for datasets below 1-2m points;
- For larger datasets, memory starts to become an issue (I doubt that library users will realistically have more than 128 GB of RAM);
- As for faiss, my results were the following:
- Plain index does not fit on one conventional GPU;
- Approximate index that is supported on GPU takes ~500MB of GPU RAM (17% of RAM are reserved) => I guess it can even be scaled much and much further;
- KNN graph for ~7-10m points is calculated ~2 hours tops;
- I guess it can be scaled up to 100m of points this way easily;
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/125#issuecomment-416036639, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBdXRO-yjOTYwrh4wLUfY4JRM4jI5ks5uUpkKgaJpZM4WBl-g .
coming out of fuzzy_simplicial_set
It is also ones and zeros. What is interesting - I compared your KNN graph and faiss KNN graph - they seem almost identical. I guess the devil lies somewhere in the data types (I believe I save my graph knn indices in float). I will check several times and revert.
Hi again, everyone! Would like to ask for some advice, probably
UMAP
related, probably more abstract.I have had some success using this combination of tools in my work:
PCA
to reduce dimensions to 30-50;UMAP
to reduce to 10;HDBSCAN
for clustering;All of this fine and dandy, but from my experiments it seems that with datasets around 10m points I start having trouble with the above scheme.
Obviously I can try different tricks (feeding HDBSCAN from PCA directly, reducing dimensions more aggressively e.g. 300=>20=>10, sub-sampling data and then just applying
.transform
, buying more :ram: , etc).But maybe someone knows if there are any approaches for producing high-quality embeddings in a parallelized / mini-batch fashion? If so, can you probably point me in the correct direction? I know, that people train matrix factorization machines on the GPU, maybe there is something similar?
Or maybe I am just missing something with my setup?
I understand, that in the long run, the correct approach is
Many thanks!