h2oai / h2o-tutorials

Tutorials and training material for the H2O Machine Learning Platform
http://h2o.ai
1.48k stars 1.01k forks source link

Is there a preferred method of saving and loading h2o word2vec models in python? #144

Closed geoffkip closed 3 years ago

geoffkip commented 3 years ago

I have trained a word2vec model in the python h2o package. Is there a simple way for me to save that word2vec model and load it back later for use?

I have tried the h2o.save_model() and h2o.load_model() functions with no luck. I get an error using that approach like


water.exceptions.H2OIllegalArgumentException
[1] "water.exceptions.H2OIllegalArgumentException: Illegal argument: dir of function: importModel:
I am using the same version of h2o to train and load the model back in so the issue outlined in this question is not applicable Can't import binay h2o model with h2o.loadModel() function: 412 Precondition Failed

Any one with any insights on how to save and load an h2o word2vec model?

I realize more importantly than saving the model it is important to save the word vector embeddings to use later as a pre-trained model.

Is doing something like this best practice?


import h2o
from h2o.estimators import H2OWord2vecEstimator

df['text'] = df['text'].ascharacter()

# Break text into sequence of words
words = tokenize(df["text"])

# Initializing h2o
print('Initializing h2o.')
h2o.init(ip=h2o_ip, port=h2o_port, min_mem_size=h2o_min_memory) 

# Build word2vec model:
w2v_model = H2OWord2vecEstimator(sent_sample_rate = 0.0, epochs = 10)
w2v_model.train(training_frame=words)

#Create word vector embedding h20 frame
w2v_frame = w2v_model.to_frame()

#Export word embeddings to file for later use
h2o.export_file(w2v_frame,'/mnt/results/words_embeddings.csv',force=True)

# Import word embeddings later for pretrained model 
w2v_frame = h2o.import_file('/mnt/results/words_embeddings.csv')

#Define pretrained word2vec model
w2v_model2 = H2OWord2vecEstimator(pre_trained = w2v_frame, vec_size = 100)

# Train on words
w2v_model2.train(training_frame=words)