lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.48k stars 808 forks source link

Parametric UMAP saves multiple copies of full input data #1118

Open bnelsj opened 6 months ago

bnelsj commented 6 months ago

Parametric UMAP stores multiple copies of the full input data, but these are unnecessary for transforming new data points. By deleting self._raw_data and self._knn_search_index._raw_data from my Parametric UMAP model object, I was able to reduce the size of the saved model from 90 GB to 300 MB (the input data is a distance matrix with 80K locations). This might not work for models that require additional training, but perhaps should be an option when model size is an issue?

timsainb commented 6 months ago

Sounds like a good idea.

On Thu, May 2, 2024 at 5:37 PM Brad Nelson @.***> wrote:

Parametric UMAP stores multiple copies of the full input data, but these are unnecessary for transforming new data points. By deleting self._raw_data and self._knn_search_index._raw_data from my Parametric UMAP model object, I was able to reduce the size of the saved model from 90 GB to 300 MB (the input data is a distance matrix with 80K locations). This might not work for models that require additional training, but perhaps should be an option when model size is an issue?

— Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/1118, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJYKBWB42UCIRPIVNCSIJLZAKWZJAVCNFSM6AAAAABHEPKJE6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGI3TMNJUGUYDQNQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Tim Sainburg https://timsainburg.com/ Postdoctoral Fellow Harvard Medical School 814.574.7780, @.***

bartbroere commented 6 months ago

I'm having the same issue, where I want the trained model to be as small as possible (the inference machine does not have as much memory as the training machine). I'll link a PR where I added a parameter to remove the raw data to the save method.