h2oai / h2o4gpu

H2Oai GPU Edition
Apache License 2.0
459 stars 94 forks source link

Tests for model (un)pickle #591

Open mdymczyk opened 6 years ago

mdymczyk commented 6 years ago

We should add tests for saving and loading pickled models (something like what we already have for XGBoost) for all algorithms (pogs based, kmeans, tsvd, pca) to see if we can actually save and load all our models.

pseudotensor commented 6 years ago

We don't for gpu kmeans and gpu svd. xgboost has a special hook so it copies over (in C) any references in python. We can do the same.

mdymczyk commented 6 years ago

@pseudotensor do we really need that, though? At least for KMeans the only thing we need are the centroids, which we have in Python as a numpy array so we could just pickle and unpickle that. After unpickling we can just pass it (as we do now) to the C backend, no?

mdymczyk commented 6 years ago

For example seems to work out of the box for KMeans (verbose logs to show it's running on GPUs):

>>> import pickle
>>> import h2o4gpu
>>> import numpy as np
>>>
>>> X = np.array([[1.,1.], [1.,4.], [1.,0.]])
>>>
>>> model = h2o4gpu.KMeans(verbose=100, n_clusters=2,random_state=1234).fit(X)

Using GPU KMeans solver with 2 GPUs.

Using h2o4gpu backend.

Using GPU KMeans solver with 2 GPUs.

Detected np.float64 data
2 gpus.
Copying data to device: 1
Copying data to device: 0
Threshold triggered. Terminating early.
  Time fit: 0.00288296 s
Timetransfer: 0.0531921 Timefit: 0.00288296 Timecleanup: 0.00114107
>>> model.cluster_centers_
array([[1., 1.],
       [1., 4.]])
>>>
>>> pickle.dump( model, open( "save.p", "wb" ) )
>>> unpickled_model = pickle.load( open( "save.p", "rb" ) )
>>> unpickled_model.cluster_centers_
array([[1., 1.],
       [1., 4.]])
>>> model.predict(X)

Using GPU KMeans solver with 2 GPUs.

Detected np.float64 data
Detected np.float64 data
2 gpus.
array([1, 0, 0], dtype=int32)
>>> unpickled_model.predict(X)

Using GPU KMeans solver with 2 GPUs.

Detected np.float64 data
Detected np.float64 data
2 gpus.
array([1, 0, 0], dtype=int32)
pseudotensor commented 6 years ago

Yes, should be easy (or already true) that for kmeans easy as only thing fit does is find centroids.

mdymczyk commented 6 years ago

@pseudotensor yes, thought it would work out of the box for all our models since we copy all the necessary data from C to Python, but @wenphan noticed that for POGS based models it was having problems pickling (ask from a potential user). From the log it had to do something with . CDLL and/or ctypes so maybe for POGS we need to do some more work but hopefully kmeans and svd are already good.

pseudotensor commented 6 years ago

We should move forward on dropping pogs anyways. I have a gblinear wrapper we can use as base line that does lambda search with warm start. We can use the rest of the CV fold stuff but do it in python instead of C. Probably easiest.

mdymczyk commented 6 years ago

@pseudotensor yes once @RAMitchell impl is stable enough I'm 100% for removing POGS altogether from the codebase.