Open exalate-issue-sync[bot] opened 1 year ago
Ana Rocha commented: Guys, please please please solve this… It takes a big effort to make a workaround this problem, it makes Word2VecMojoModel.transform0() useless, since it’s not unusal to have a misspelled words. 😣
Maybe on [this|https://github.com/h2oai/h2o-3/blob/master/h2o-genmodel/src/main/java/hex/genmodel/algos/word2vec/Word2VecMojoModel.java] class, you could just do something like this:
Instead of:
{code:java} @Override public float[] transform0(String word, float[] output) { float[] vec = _embeddings.get(word); if (vec == null) return null; System.arraycopy(vec, 0, output, 0, output.length); return output; } {code}
You could try something like that (I did not download the whole project locally, it’s just a hint….):
{code:java}@Override
public float[] transform0(String word, float[] output) {
float[] vec = _embeddings.get(word);
if (vec == null){
Integer[] oov_vec = new Integer[_vecSize];
Arrays.fill(oov_vec,new Integer(0));
return oov_vec;
}
System.arraycopy(vec, 0, output, 0, output.length);
return output;
}{code}
It’s a commom concept to have a representation for [Out Of Vocabulary|https://medium.com/@shabeelkandi/handling-out-of-vocabulary-words-in-natural-language-processing-based-on-context-4bbba16214d5] words. Hope you can make this big gain/small effort improvement soon!
Cheers,
Ana Rocha
JIRA Issue Migration Info
Jira Issue: PUBDEV-5867 Assignee: New H2O Bugs Reporter: Lauren DiPerna State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A
Original SO question posted [here|https://stackoverflow.com/questions/51954502/h2o-aggregate-method-none-mapping-unknown-words-to-nan-and-not-vector]
We would like to add functionality to Word2Vec's tranform aggregate method so that users have a way to distinguish between missing words (words that were not originally in their dictionary before they tokenized their sentences) and NAs that mark the start and end of a record.
Code below shows how NA is used to demarcate start and end of sentences. {code} library(h2o) h2o.init()
job.titles.path = "https://raw.githubusercontent.com/h2oai/sparkling-water/rel-1.6/examples/smalldata/craigslistJobTitles.csv"
job.titles <- h2o.importFile(job.titles.path, destination_frame = "jobtitles", col.names = c("category", "jobtitle"), col.types = c("Enum", "String"), header = TRUE)
STOP_WORDS = c("ax","i","you","edu","s","t","m","subject","can","lines","re","what", "there","all","we","one","the","a","an","of","or","in","for","by","on", "but","is","in","a","not","with","as","was","if","they","are","this","and","it","have", "from","at","my","be","by","not","that","to","from","com","org","like","likes","so")
tokenize <- function(sentences, stop.words = STOP_WORDS) { tokenized <- h2o.tokenize(sentences, "\\W+")
convert to lower case
tokenized.lower <- h2o.tolower(tokenized)
remove short words (less than 2 characters)
tokenized.lengths <- h2o.nchar(tokenized.lower) tokenized.filtered <- tokenized.lower[is.na(tokenized.lengths) || tokenized.lengths >= 2,]
remove words that contain numbers
tokenized.words <- tokenized.filtered[h2o.grep("[0-9]", tokenized.filtered, invert = TRUE, output.logical = TRUE),]
remove stop words
tokenized.words[is.na(tokenized.words) || (! tokenized.words %in% STOP_WORDS),] } words <- tokenize(job.titles$jobtitle)
w2v.model <- h2o.word2vec(words, sent_sample_rate = 0, epochs = 10) print(words)
transformed = h2o.transform(w2v.model, words, aggregate_method = "None")
print(transformed) {code}