When loading word embeddings via thrift: close words (for and four) were modified to one word (for and for)?! and resulting in an error

c0der1337 commented 9 years ago

Hey guys,

I exported the word embeddings from polyglot (https://sites.google.com/site/rmyeid/projects/polyglot) in the thrift/jason format in order to load them into this model.

I got an error: "Multiple entries with same key" I checked the file and there are no words more than once. So I let me print out the "double words":

model.vocab.size: 1000 ++++ for i: 20| h: 195: for ++++ or i: 43| h: 910: or ++++ p i: 83| h: 203: p ++++ de i: 165| h: 304: de ++++ for i: 195| h: 195: for ++++ p i: 203| h: 203: p ++++ de i: 304| h: 304: de ++++ or i: 910| h: 910: or Exception in thread "main" java.lang.IllegalArgumentException: Multiple entries with same key: for=[D@1813ed0e and for=[D@44303e7b at com.google.common.collect.ImmutableMap.checkNoConflict(ImmutableMap.java:150) at com.google.common.collect.RegularImmutableMap.checkNoConflictInBucket(RegularImmutableMap.java:104) at com.google.common.collect.RegularImmutableMap.(RegularImmutableMap.java:70) at com.google.common.collect.ImmutableMap$Builder.build(ImmutableMap.java:254) at com.medallia.word2vec.SearcherImpl.(SearcherImpl.java:39) at com.medallia.word2vec.Word2VecModel.forSearch(Word2VecModel.java:39) at com.medallia.word2vec.Word2VecExamples.loadModel(Word2VecExamples.java:93) at com.medallia.word2vec.Word2VecExamples.main(Word2VecExamples.java:35)

In the next step I looked up in the thrift-file and I found out that they are nearly the same, but different words. The program transform them to same words after loading them. Here are two examples:

In [75]: words[20] Out[75]: u'for'

In [76]: words[195] Out[76]: u'four'

In [77]: words[43] Out[77]: u'or'

In [78]: words[910] Out[78]: u'our'

I find this behavior is not correct. Can you say me where I can fix that? I can't find the correspoding code lines... thanks in advance!

Here is the coresponding part of the code, where I added my print outs:

SearcherImpl(Word2VecModel model) { this.model = model; ImmutableMap.Builder<String, double[]> result = ImmutableMap.builder(); System.out.println("model.vocab.size: "+model.vocab.size()); for (int i = 0; i < model.vocab.size(); i++) { double[] m = Arrays.copyOfRange(model.vectors, i * model.layerSize, (i + 1) * model.layerSize); normalize(m); result.put(model.vocab.get(i), m); int count=0; for (int h = 0; h < model.vocab.size(); h++) { if(model.vocab.get(i).equals(model.vocab.get(h))){ count= count +1; } if(count==2){

                System.out.println("++++ "+model.vocab.get(i)+" i: "+i+"| h: "+h +": "+model.vocab.get(h));
                count=1;
            }
        }

    }
    normalized = result.build();
}

wko27 commented 9 years ago

Hello!

Hmm, that's pretty odd.

The Word2VecModel.fromThrift method just loads the vocabulary straight from thrift. The SearcherImpl constructor then loads that vocabulary into the map, assuming unique values.

How did you export the data into the thrift json format?

Perhaps in the Word2VecModel constructor, you can also copy the vocab into a HashSet and see if there are duplicates there?

Cheers, Andrew

c0der1337 commented 9 years ago

I found the problem, it was my mistake. Sorry

medallia / Word2VecJava

When loading word embeddings via thrift: close words (for and four) were modified to one word (for and for)?! and resulting in an error #1

Here is the coresponding part of the code, where I added my print outs: