lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.39k stars 803 forks source link

Reducing Model Size for UMAP on Large Datasets #1097

Closed C-Harlin closed 7 months ago

C-Harlin commented 7 months ago

I am working with a large dataset of approximately 200 million entries. Due to the sheer volume and dimensionality of the data, I am constrained to fitting the model on a very small sample (e.g., 100,000 entries, about 0.05% of the data) , and then using Spark to pickle.load the previously fitted model (roughly 10GB in size) to transform all the data.

It is evident that utilizing more data for fitting would yield more accurate results. However, this also leads to a larger fitted model, which increases the risk of Out of Memory (OOM) errors during the transformation process when using Spark. From my mere understanding, the fitted model includes embeddings of the training set, which contributes to the increase in model size when more training data is used.

My question pertains to non-parametric UMAP: Is it necessary for the fit results to contain the training set embeddings, or is it possible to retain only the essential parts needed for transformation? I apologize if this question seems naive; despite having read How UMAP Works I might still lack clarity on the finer details of UMAP.

Or, are there any strategies to maintain a smaller model size post-fitting while not compromising the quality of dimensionality reduction? I would greatly appreciate any insights or suggestions.

lmcinnes commented 7 months ago

I think your best bet might be to use ParametricUMAP where you are training a neural network to produce a UMAP embedding. The final model is then just a tensorflow model that you can inference from, and can be as large or small as the model you define. That may meet your needs.

C-Harlin commented 7 months ago

I think your best bet might be to use ParametricUMAP where you are training a neural network to produce a UMAP embedding. The final model is then just a tensorflow model that you can inference from, and can be as large or small as the model you define. That may meet your needs.

I will explore this approach further. Thanks for your assistance!