Documentation Word2Vec embedding and CNN on H2O R

dwy904 commented 6 years ago

Hi, I am wondering if it's possible to provide any documentation or example code about using word2vec and cnn on text classification in H2O DeepWater R version ?

By the way, is there any detailed tutorials or example about how to use the deep water in R ?

mstensmo commented 6 years ago

Currently there's no way to use text inputs for Deepwater. H2O-3 has a word2vec implementation though, see http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/word2vec.html

dwy904 commented 6 years ago

Thank you for that.

I know that h2o has the capability to build word2vec model. I am wondering if it's possible to use the output of the h2o.word2vec model (word vectors for each document, not average) directly as the inputs of the deep water cnn (lenet).

Since there's not much tutorial regarding how to apply deep water CNN in R, I am wondering if there's any sample code for this combination ( h2o.word2vec + h2o.deepwater ).

dwy904 commented 6 years ago

any update on this ?

mdymczyk commented 6 years ago

@dwy904 we don't have a tutorial specifically for that. What you'd have to do is build a Word2Vec model first, then transform your words to a vector space using it. This will give you an H2O Frame which you can then pass as input to h2o.deepwater.

dwy904 commented 6 years ago

Can h2o.deepwater read the word2vec feature matrix ?

The format for each document is a n (number of words) by m (feature vector size). The length of n depends on the document size.

After I used the h2o.transform function, it merges all this matrix into a h2o data frame and each document is separated by a row that is filled with NA values.

How do I format my response variable into this format ?

On Sat, Nov 4, 2017 at 2:37 AM Mateusz Dymczyk notifications@github.com wrote:

@dwy904 https://github.com/dwy904 we don't have a tutorial specifically for that. What you'd have to do is build a Word2Vec model first, then transform your words to a vector space using it. This will give you an H2O Frame which you can then pass as input to h2o.deepwater.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/h2oai/deepwater/issues/57#issuecomment-341876042, or mute the thread https://github.com/notifications/unsubscribe-auth/ALcBdtMNUDglhNG4BjNhYEC7aDv9XLHsks5szAYrgaJpZM4QGKBy .

mdymczyk commented 6 years ago

@dwy904 what is your exact input and what do you exactly want to insert into lenet?

NAs should appear in transform only if you have NAs in the original data frame (that's how we separate documents) inserted into word2vec.

If you don't want to do averaging, you might not need to separate those documents - all depends on what you want to input into your lenet. Do you want to insert single word vectors or several word vectors (which would make 1 document)? Remember that with lenet the input size of your vector has to be constant. So if you want to use as input to your network a "document" then every document needs to have the same number of word vectors. For that you'd have to do a bit of preprocessing - you'd have to create a new h2o frame, where each row would be a concatenation of N rows for a document.

dwy904 commented 6 years ago

Got it, that is my concern. I was wondering if there is a way to put the word2vec output (serval word vectors for each document, not averaging) into the lenet directly without any preprocessing (padding, etc).

The size of my original text is extremely large, if I convert all of them to a same size matrix, then I will not have enough memory.

Does h2o.deepwater provide any option to run the preprocessing step on each batch training instead of the entire dataset ?

mdymczyk commented 6 years ago

@dwy904 this would be a good feature request but currently, we cannot do that, unfortunately. There is a workaround but a bit annoying, imho, and you'd have to know the max number of words in a document up front.

You'd have to take our Lenet code (https://github.com/h2oai/deepwater/blob/master/tensorflow/src/main/resources/deepwater/models/lenet.py), and pad the input (https://github.com/h2oai/deepwater/blob/master/tensorflow/src/main/resources/deepwater/models/lenet.py#L18) using https://www.tensorflow.org/api_docs/python/tf/pad to your desired size.

Then you'd have to save it and load when calling deepwater like here https://github.com/h2oai/h2o-3/blob/master/examples/deeplearning/notebooks/deeplearning_tensorflow_cat_dog_mouse_lenet.ipynb (Custom model section).

dwy904 commented 6 years ago

Thank you so much. That helps a lot. Could you forward this to the feature request section ?

By the way, for the links you provided above, is there a r version ?

mdymczyk commented 6 years ago

@dwy904 we don't really take feature requests atm, the project is currently on hiatus. Trying to figure out next steps.

There's no R version - all those files are simply TF network definitions, TF doesn't have an R API.

wwfwwf commented 5 years ago

Can someone share me sampling input data for the H2O word2vec，I had tried all kinds of input format of Chinese Character, but word2vec can't run rightly and returns errors as follows :java.lang.NegativeArraySizeException；

wwfwwf commented 5 years ago

by the way, does H2O word2vec support Chinese Character?

dwy904 commented 5 years ago

I can share you my example code but I have to to find it first. It has been a year.

mdymczyk commented 5 years ago

@wwfwwf @dwy904 Word2Vec is implemented in H2O-3 - you should ask here https://github.com/h2oai/h2o-3/ (or preferably on their Gitter). Deepwater is not developed anymore.

h2oai / deepwater

Documentation Word2Vec embedding and CNN on H2O R #57