Closed ghost closed 5 years ago
I should work just like other scikit-learn modesl, and you can ideally simply pickle and unpickle the estimator. Potentially you may want to use joblib dump and load instead as it will be a little more robust in dealing with large numpy arrays etc.
On Sat, Dec 8, 2018 at 11:04 PM Jérémie Gauthier notifications@github.com wrote:
I am very curious of knowing how to save and load a model with UMAP. That package can be used as a new machine learning technique. I was think to train and test a dataset with UMAP.
How is it possible to save and load a model with UMAP?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/178, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBfCHI08QQCrggf5vcoy5uq6riYaCks5u3IuzgaJpZM4ZJ8T- .
Are you up to add a snippet to show how it works?
How about this?
import numpy as np
from sklearn.datasets import load_digits
digits = load_digits()
X = digits.data
y = digits.target
import umap
reducer = umap.UMAP()
embedding = reducer.fit_transform(X, y)
print(embedding.shape)
# (1797, 2)
import joblib
filename = 'this_is_a_test.sav'
joblib.dump(reducer, filename)
#time passes
loaded_reducer = joblib.load(filename)
print(type(loaded_reducer))
# <class 'umap.umap_.UMAP'>
# with pickle
import pickle
f_name = 'saving_example.sav'
pickle.dump(reducer, open(f_name, 'wb'))
# time passes
loaded_model = pickle.load((open(f_name, 'rb')))
print(type(loaded_model))
# <class 'umap.umap_.UMAP'>
Probably best added to one of the example notebooks.
As always, incredible stuff @lmcinnes
I have already found both option, but thanks for your time @dylanking42 :D
No worries buddy, probably can close this issue then unless anyone wants to copy paste it onto one of the notebooks that already exists.
I got a >1GB file that even depends on PySpark when I pickled the UMAP instance directly... is there already a list of object properties needed for transformation somewhere?
@herrmann Sorry, I haven't gotten that well documented yet. Right now the best case is looking through the transform method code itself at what gets used; not ideal, but it is what is available right now.
@lmcinnes The models I made are at very least 3 GB, with 5 GB being the average. They're made with 150 features and 50 GB data set. I am making an open source software to help conservation scientists in classification of birds from audio recordings. It's a web tool that should allow a user to select a model. Now, we models that big the user experience will be poor.
Any way to slim down the model size? I am using joblib with reasonable compression. I don't get why the models are that enormous. Why model should depend on @herrmann PySpark?
@herrmann Have you gotten this to work, i.e. the slice of model needed for transform only?
@lmcinnes I will be happy to contribute by documenting what's neeeded or building export functionality. If you have any tips please shoot - thanks.
The model is much like a knn classifier model; it needs the original data, and that's a large part of what makes it so big. I don't think there are any immediate solutions for that. Longer term I may eventually have some answers, but that is not something that will be implemented any time soon (the theory still has to be worked out, let alone making it practical). Sadly I would suggest that if you need a slim model you will need something simpler like PCA. Sorry I can't be more help.
I am able to save the model but I am getting an error while loading it. I am using scimitar-learn (0.20.3)
reducer = joblib.load('reducer_umap_jl_sc.sav') File "/home/dshresth/.conda/envs/py27/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 598, in load obj = _unpickle(fobj, filename, mmap_mode) File "/home/dshresth/.conda/envs/py27/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 526, in _unpickle obj = unpickler.load() File "/home/dshresth/.conda/envs/py27/lib/python2.7/pickle.py", line 864, in load dispatch[key](self) File "/home/dshresth/.conda/envs/py27/lib/python2.7/pickle.py", line 1089, in load_newobj obj = cls.__new__(cls, *args) File "/home/dshresth/.conda/envs/py27/lib/python2.7/site-packages/funcsigs/__init__.py", line 201, in __new__ obj._name = kwargs['name'] KeyError: 'name'
I am able to save the model but I am getting an error while loading it. I am using scimitar-learn (0.20.3)
reducer = joblib.load('reducer_umap_jl_sc.sav') File "/home/dshresth/.conda/envs/py27/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 598, in load obj = _unpickle(fobj, filename, mmap_mode) File "/home/dshresth/.conda/envs/py27/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 526, in _unpickle obj = unpickler.load() File "/home/dshresth/.conda/envs/py27/lib/python2.7/pickle.py", line 864, in load dispatch[key](self) File "/home/dshresth/.conda/envs/py27/lib/python2.7/pickle.py", line 1089, in load_newobj obj = cls.__new__(cls, *args) File "/home/dshresth/.conda/envs/py27/lib/python2.7/site-packages/funcsigs/__init__.py", line 201, in __new__ obj._name = kwargs['name'] KeyError: 'name'
changing the python version to 3.6 solved it.
How about this?
import numpy as np from sklearn.datasets import load_digits digits = load_digits() X = digits.data y = digits.target import umap reducer = umap.UMAP() embedding = reducer.fit_transform(X, y) print(embedding.shape) # (1797, 2) import joblib filename = 'this_is_a_test.sav' joblib.dump(reducer, filename) #time passes loaded_reducer = joblib.load(filename) print(type(loaded_reducer)) # <class 'umap.umap_.UMAP'> # with pickle import pickle f_name = 'saving_example.sav' pickle.dump(reducer, open(f_name, 'wb')) # time passes loaded_model = pickle.load((open(f_name, 'rb'))) print(type(loaded_model)) # <class 'umap.umap_.UMAP'>
Probably best added to one of the example notebooks.
As always, incredible stuff @lmcinnes
This makes it OS and environment dependent.
I am very curious of knowing how to save and load a model with UMAP. That package can be used as a new machine learning technique. I was thinking to train and test a dataset with UMAP.
How is it possible to save and load a model with UMAP?