Closed GoogleCodeExporter closed 9 years ago
The ESA indexes you have downloaded are VectorIndexes that store for each word
the full vector over the document space, i.e. when you want to compute the
similarity between two words the similarity measure just retrieves the vectors
stored for the words and computes the similarity. You need a VectorIndexReader
to load them and you cannot change the vectors anymore (e.g. normalize or use
some weighting scheme).
LuceneVectorReader are used to create the above vectors from a Lucene index
build from a document collection. When a similarity measure wants to get a
vector for a word, it queries Lucene for the term frequencies of that word in
each document. This is more flexible, as we can do all sorts of weighting and
normalizing, but it is also much, much slower.
I have also added a similar text to the wiki page in order to avoid future
confusion.
Original comment by torsten....@gmail.com
on 18 Oct 2013 at 2:01
In order to document that here:
If you want to create a ESA measure from the downloaded files use
TextSimilarityMeasure measure = new VectorComparator(new CachingVectorReader(
new VectorIndexReader(new File(modelLocation)),
Integer.parseInt(cacheSize)));
Original comment by torsten....@gmail.com
on 18 Oct 2013 at 2:03
Thanks it worked perfectly with all three dumps wordnet, wikipedia and
wiktionary.
Just one comment though, some more javadoc would have been better say on
choosing right cacheSize and its impact in association with vector getting
loaded (if any). I am currently simulating with different cacheSizes and
profiling my application to check memory footprints.
Thanks again.
Original comment by harsh.11...@gmail.com
on 18 Oct 2013 at 4:05
Good to hear that it works now.
It would be nice if you could report your findings on the cache size here. I am
not sure we ever really tested that consistently.
Original comment by torsten....@gmail.com
on 18 Oct 2013 at 4:18
Original issue reported on code.google.com by
harsh.11...@gmail.com
on 18 Oct 2013 at 1:00Attachments: