exalate-issue-sync[bot] commented 1 year ago

Add an example of Word2Vec in Scala, that replicates what our current R example does (shown below)

The example should include:

How to create or use H2O's tokenizer (clarify when a user needs to keep separators (NAs) to delimit the records. For example, if a user has two records "I like NY" and "I like CA", it needs to be converted to a tokenized Vec like this: "I", "like", "NY", NA, "I", "like", "CA", NA)
How to use the transform function (clarifies that: The transform(..) function takes an H2O Vec as the first parameter, the vector needs to be extracted from the H2O frame wordsh2oFrame)

R code

{code:r}

Build a dummy word2vec model

library(h2o) h2o.init(nthread=-1) data <- as.character(as.h2o(c("a", "b", "a"))) w2v.model <- h2o.word2vec(data, sent_sample_rate = 0, min_word_freq = 0, epochs = 1, vec_size = 2)

Transform words to vectors without aggregation

sentences <- as.character(as.h2o(c("b", "c", "a", NA, "b"))) h2o.transform(w2v.model, sentences) # -> 5 rows total, 2 rows NA ("c" is not in the vocabulary)

Transform words to vectors and return average vector for each sentence

h2o.transform(w2v.model, sentences, aggregate_method = "AVERAGE") # -> 2 rows h2o.transform <- function(word2vec, words, aggregate_method = c("NONE", "AVERAGE")) {

if (!is(word2vec, "H2OModel")) stop("word2vec must be a word2vec model") if (missing(words)) stop("words must be specified") if (!is.H2OFrame(words)) stop("words must be an H2OFrame") if (ncol(words) != 1) stop("words frame must contain a single string column")

if (length(aggregate_method) > 1) aggregate_method <- aggregate_method[1]

res <- .h2o.__remoteSend(method="GET", "Word2VecTransform", model = word2vec@model_id, words_frame = h2o.getId(words)) words_frame = h2o.getId(words), aggregate_method = aggregate_method) key <- res$vectors_frame$name h2o.getFrame(key) } {code}

exalate-issue-sync[bot] commented 1 year ago

Lauren DiPerna commented: It maybe better to update the R example to look like the attached word2vec_example.scala (but update both with a bigger text array, so the results are more intesting)

exalate-issue-sync[bot] commented 1 year ago

Lauren DiPerna commented: would also be great if we could update this example https://github.com/h2oai/qcon2015/blob/master/03-ask-craig/craigslistJobTitles.script.scala#L153-L188 to use H2O's word2vec instead of spark's

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-4561 Assignee: Michal Kurka Reporter: Lauren DiPerna State: Open Fix Version: N/A Attachments: Available (Count: 1) Development PRs: N/A

Attachments From Jira

Attachment Name: word2vec_example.scala Attached By: Lauren DiPerna File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-4561/word2vec_example.scala

h2oai / h2o-3

Add Word2Vec Scala Examples #11444

Build a dummy word2vec model

Transform words to vectors without aggregation

Transform words to vectors and return average vector for each sentence