lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.38k stars 803 forks source link

Save and load a model with UMAP #178

Closed ghost closed 5 years ago

ghost commented 5 years ago

I am very curious of knowing how to save and load a model with UMAP. That package can be used as a new machine learning technique. I was thinking to train and test a dataset with UMAP.

How is it possible to save and load a model with UMAP?

lmcinnes commented 5 years ago

I should work just like other scikit-learn modesl, and you can ideally simply pickle and unpickle the estimator. Potentially you may want to use joblib dump and load instead as it will be a little more robust in dealing with large numpy arrays etc.

On Sat, Dec 8, 2018 at 11:04 PM Jérémie Gauthier notifications@github.com wrote:

I am very curious of knowing how to save and load a model with UMAP. That package can be used as a new machine learning technique. I was think to train and test a dataset with UMAP.

How is it possible to save and load a model with UMAP?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/178, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBfCHI08QQCrggf5vcoy5uq6riYaCks5u3IuzgaJpZM4ZJ8T- .

ghost commented 5 years ago

Are you up to add a snippet to show how it works?

dylanking42 commented 5 years ago

How about this?

import numpy as np
from sklearn.datasets import load_digits

digits = load_digits()

X = digits.data
y = digits.target

import umap

reducer = umap.UMAP()

embedding = reducer.fit_transform(X, y)
print(embedding.shape)
# (1797, 2)

import joblib
filename = 'this_is_a_test.sav'
joblib.dump(reducer, filename)

#time passes 
loaded_reducer = joblib.load(filename)

print(type(loaded_reducer))
# <class 'umap.umap_.UMAP'>

# with pickle
import pickle

f_name = 'saving_example.sav'
pickle.dump(reducer, open(f_name, 'wb'))

# time passes
loaded_model = pickle.load((open(f_name, 'rb')))
print(type(loaded_model))
# <class 'umap.umap_.UMAP'>

Probably best added to one of the example notebooks.

As always, incredible stuff @lmcinnes

ghost commented 5 years ago

I have already found both option, but thanks for your time @dylanking42 :D

dylanking42 commented 5 years ago

No worries buddy, probably can close this issue then unless anyone wants to copy paste it onto one of the notebooks that already exists.

herrmann commented 5 years ago

I got a >1GB file that even depends on PySpark when I pickled the UMAP instance directly... is there already a list of object properties needed for transformation somewhere?

lmcinnes commented 5 years ago

@herrmann Sorry, I haven't gotten that well documented yet. Right now the best case is looking through the transform method code itself at what gets used; not ideal, but it is what is available right now.

tracek commented 5 years ago

@lmcinnes The models I made are at very least 3 GB, with 5 GB being the average. They're made with 150 features and 50 GB data set. I am making an open source software to help conservation scientists in classification of birds from audio recordings. It's a web tool that should allow a user to select a model. Now, we models that big the user experience will be poor.

Any way to slim down the model size? I am using joblib with reasonable compression. I don't get why the models are that enormous. Why model should depend on @herrmann PySpark?

@herrmann Have you gotten this to work, i.e. the slice of model needed for transform only?

@lmcinnes I will be happy to contribute by documenting what's neeeded or building export functionality. If you have any tips please shoot - thanks.

lmcinnes commented 5 years ago

The model is much like a knn classifier model; it needs the original data, and that's a large part of what makes it so big. I don't think there are any immediate solutions for that. Longer term I may eventually have some answers, but that is not something that will be implemented any time soon (the theory still has to be worked out, let alone making it practical). Sadly I would suggest that if you need a slim model you will need something simpler like PCA. Sorry I can't be more help.

dewshr commented 5 years ago

I am able to save the model but I am getting an error while loading it. I am using scimitar-learn (0.20.3)

reducer = joblib.load('reducer_umap_jl_sc.sav') File "/home/dshresth/.conda/envs/py27/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 598, in load obj = _unpickle(fobj, filename, mmap_mode) File "/home/dshresth/.conda/envs/py27/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 526, in _unpickle obj = unpickler.load() File "/home/dshresth/.conda/envs/py27/lib/python2.7/pickle.py", line 864, in load dispatch[key](self) File "/home/dshresth/.conda/envs/py27/lib/python2.7/pickle.py", line 1089, in load_newobj obj = cls.__new__(cls, *args) File "/home/dshresth/.conda/envs/py27/lib/python2.7/site-packages/funcsigs/__init__.py", line 201, in __new__ obj._name = kwargs['name'] KeyError: 'name'

dewshr commented 5 years ago

I am able to save the model but I am getting an error while loading it. I am using scimitar-learn (0.20.3)

reducer = joblib.load('reducer_umap_jl_sc.sav') File "/home/dshresth/.conda/envs/py27/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 598, in load obj = _unpickle(fobj, filename, mmap_mode) File "/home/dshresth/.conda/envs/py27/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 526, in _unpickle obj = unpickler.load() File "/home/dshresth/.conda/envs/py27/lib/python2.7/pickle.py", line 864, in load dispatch[key](self) File "/home/dshresth/.conda/envs/py27/lib/python2.7/pickle.py", line 1089, in load_newobj obj = cls.__new__(cls, *args) File "/home/dshresth/.conda/envs/py27/lib/python2.7/site-packages/funcsigs/__init__.py", line 201, in __new__ obj._name = kwargs['name'] KeyError: 'name'

changing the python version to 3.6 solved it.

Miladiouss commented 10 months ago

How about this?

import numpy as np
from sklearn.datasets import load_digits

digits = load_digits()

X = digits.data
y = digits.target

import umap

reducer = umap.UMAP()

embedding = reducer.fit_transform(X, y)
print(embedding.shape)
# (1797, 2)

import joblib
filename = 'this_is_a_test.sav'
joblib.dump(reducer, filename)

#time passes 
loaded_reducer = joblib.load(filename)

print(type(loaded_reducer))
# <class 'umap.umap_.UMAP'>

# with pickle
import pickle

f_name = 'saving_example.sav'
pickle.dump(reducer, open(f_name, 'wb'))

# time passes
loaded_model = pickle.load((open(f_name, 'rb')))
print(type(loaded_model))
# <class 'umap.umap_.UMAP'>

Probably best added to one of the example notebooks.

As always, incredible stuff @lmcinnes

This makes it OS and environment dependent.