h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.9k stars 2k forks source link

Update Word2Vec's Transform to Distinguish between NAs for Missing and Nas for Demarcation #12719

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

Original SO question posted [here|https://stackoverflow.com/questions/51954502/h2o-aggregate-method-none-mapping-unknown-words-to-nan-and-not-vector]

We would like to add functionality to Word2Vec's tranform aggregate method so that users have a way to distinguish between missing words (words that were not originally in their dictionary before they tokenized their sentences) and NAs that mark the start and end of a record.

Code below shows how NA is used to demarcate start and end of sentences. {code} library(h2o) h2o.init()

job.titles.path = "https://raw.githubusercontent.com/h2oai/sparkling-water/rel-1.6/examples/smalldata/craigslistJobTitles.csv"

job.titles <- h2o.importFile(job.titles.path, destination_frame = "jobtitles", col.names = c("category", "jobtitle"), col.types = c("Enum", "String"), header = TRUE)

STOP_WORDS = c("ax","i","you","edu","s","t","m","subject","can","lines","re","what", "there","all","we","one","the","a","an","of","or","in","for","by","on", "but","is","in","a","not","with","as","was","if","they","are","this","and","it","have", "from","at","my","be","by","not","that","to","from","com","org","like","likes","so")

tokenize <- function(sentences, stop.words = STOP_WORDS) { tokenized <- h2o.tokenize(sentences, "\\W+")

convert to lower case

tokenized.lower <- h2o.tolower(tokenized)

remove short words (less than 2 characters)

tokenized.lengths <- h2o.nchar(tokenized.lower) tokenized.filtered <- tokenized.lower[is.na(tokenized.lengths) || tokenized.lengths >= 2,]

remove words that contain numbers

tokenized.words <- tokenized.filtered[h2o.grep("[0-9]", tokenized.filtered, invert = TRUE, output.logical = TRUE),]

remove stop words

tokenized.words[is.na(tokenized.words) || (! tokenized.words %in% STOP_WORDS),] } words <- tokenize(job.titles$jobtitle)

w2v.model <- h2o.word2vec(words, sent_sample_rate = 0, epochs = 10) print(words)

transformed = h2o.transform(w2v.model, words, aggregate_method = "None")

print(transformed) {code}

exalate-issue-sync[bot] commented 1 year ago

Ana Rocha commented: Guys, please please please solve this… It takes a big effort to make a workaround this problem, it makes Word2VecMojoModel.transform0() useless, since it’s not unusal to have a misspelled words. 😣

Maybe on [this|https://github.com/h2oai/h2o-3/blob/master/h2o-genmodel/src/main/java/hex/genmodel/algos/word2vec/Word2VecMojoModel.java] class, you could just do something like this:

Instead of:

{code:java} @Override public float[] transform0(String word, float[] output) { float[] vec = _embeddings.get(word); if (vec == null) return null; System.arraycopy(vec, 0, output, 0, output.length); return output; } {code}

You could try something like that (I did not download the whole project locally, it’s just a hint….):

{code:java}@Override public float[] transform0(String word, float[] output) { float[] vec = _embeddings.get(word); if (vec == null){
Integer[] oov_vec = new Integer[_vecSize]; Arrays.fill(oov_vec,new Integer(0)); return oov_vec; } System.arraycopy(vec, 0, output, 0, output.length); return output; }{code}

It’s a commom concept to have a representation for [Out Of Vocabulary|https://medium.com/@shabeelkandi/handling-out-of-vocabulary-words-in-natural-language-processing-based-on-context-4bbba16214d5] words. Hope you can make this big gain/small effort improvement soon!

Cheers,

Ana Rocha

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-5867 Assignee: New H2O Bugs Reporter: Lauren DiPerna State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A