Closed c0der1337 closed 9 years ago
Hello!
Hmm, that's pretty odd.
The Word2VecModel.fromThrift method just loads the vocabulary straight from thrift. The SearcherImpl constructor then loads that vocabulary into the map, assuming unique values.
How did you export the data into the thrift json format?
Perhaps in the Word2VecModel constructor, you can also copy the vocab into a HashSet and see if there are duplicates there?
Cheers, Andrew
I found the problem, it was my mistake. Sorry
Hey guys,
I exported the word embeddings from polyglot (https://sites.google.com/site/rmyeid/projects/polyglot) in the thrift/jason format in order to load them into this model.
I got an error: "Multiple entries with same key" I checked the file and there are no words more than once. So I let me print out the "double words":
model.vocab.size: 1000 ++++ for i: 20| h: 195: for ++++ or i: 43| h: 910: or ++++ p i: 83| h: 203: p ++++ de i: 165| h: 304: de ++++ for i: 195| h: 195: for ++++ p i: 203| h: 203: p ++++ de i: 304| h: 304: de ++++ or i: 910| h: 910: or Exception in thread "main" java.lang.IllegalArgumentException: Multiple entries with same key: for=[D@1813ed0e and for=[D@44303e7b at com.google.common.collect.ImmutableMap.checkNoConflict(ImmutableMap.java:150) at com.google.common.collect.RegularImmutableMap.checkNoConflictInBucket(RegularImmutableMap.java:104) at com.google.common.collect.RegularImmutableMap.(RegularImmutableMap.java:70)
at com.google.common.collect.ImmutableMap$Builder.build(ImmutableMap.java:254)
at com.medallia.word2vec.SearcherImpl.(SearcherImpl.java:39)
at com.medallia.word2vec.Word2VecModel.forSearch(Word2VecModel.java:39)
at com.medallia.word2vec.Word2VecExamples.loadModel(Word2VecExamples.java:93)
at com.medallia.word2vec.Word2VecExamples.main(Word2VecExamples.java:35)
In the next step I looked up in the thrift-file and I found out that they are nearly the same, but different words. The program transform them to same words after loading them. Here are two examples:
In [75]: words[20] Out[75]: u'for'
In [76]: words[195] Out[76]: u'four'
In [77]: words[43] Out[77]: u'or'
In [78]: words[910] Out[78]: u'our'
I find this behavior is not correct. Can you say me where I can fix that? I can't find the correspoding code lines... thanks in advance!
Here is the coresponding part of the code, where I added my print outs:
SearcherImpl(Word2VecModel model) { this.model = model; ImmutableMap.Builder<String, double[]> result = ImmutableMap.builder(); System.out.println("model.vocab.size: "+model.vocab.size()); for (int i = 0; i < model.vocab.size(); i++) { double[] m = Arrays.copyOfRange(model.vectors, i * model.layerSize, (i + 1) * model.layerSize); normalize(m); result.put(model.vocab.get(i), m); int count=0; for (int h = 0; h < model.vocab.size(); h++) { if(model.vocab.get(i).equals(model.vocab.get(h))){ count= count +1; } if(count==2){