Invalid ESA vector index dumps at https://code.google.com/p/dkpro-similarity-asl/downloads/list

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Download wordnet or wiktinoary vector index dumps from 
https://code.google.com/p/dkpro-similarity-asl/downloads/list
Or download Wikipedia vector index dump 

2. Run following code: <Any Lucene versions from 2.9.1-3.5.0>

LuceneVectorReader vSrc = new LuceneVectorReader(new File("<IndexPath>"));
vSrc.setVectorAggregation(VectorAggregation.CENTROID);
vSrc.setWeightingThreshold(0.0f);
vSrc.setVectorLengthThreshold(0.0f);
vSrc.setWeightingModeTf(WeightingModeTf.normal);
vSrc.setWeightingModeIdf(WeightingModeIdf.normal);
vSrc.setNorm(VectorNorm.L2);

VectorComparator cmp = new VectorComparator(vSrc);
cmp.setInnerProduct(InnerVectorProduct.COSINE);
cmp.setNormalization(VectorNorm.NONE);
cmp.getSimilarity("text1", "text2");

3.

What is the expected output? What do you see instead?
Expected output: similarity (double)
Actual output:
java.io.FileNotFoundException: no segments* file found in 
org.apache.lucene.store.SimpleFSDirectory@E:\evaluation\dedup\dkpro\home\ESA\Vec
torIndexes\wiktionary\wiktionary_en: files: 00000000.jdb 00000001.jdb 
00000002.jdb 00000003.jdb 00000004.jdb 00000005.jdb index.conf je.info.0 je.lck
    at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:655)
    at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69)
    at org.apache.lucene.index.IndexReader.open(IndexReader.java:476)
    at org.apache.lucene.index.IndexReader.open(IndexReader.java:314)
    at dkpro.similarity.algorithms.vsm.store.LuceneVectorReader.getReader(LuceneVectorReader.java:217)

What version of the product are you using? On what operating system?
Windows7
pom dependencies: attached

Please provide any additional information below.
Running test (dkpro.similarity.algorithms.vsm-asl) 
VectorComparatorLuceneVectorSourceTest passes all test successfully. It uses 
"src/test/resources/vsm/test_index_token" index folder which seems to contain 
segment file.

Can you please assist me in using right dump or point me how to correct this 
problem.

Original issue reported on code.google.com by harsh.11...@gmail.com on 18 Oct 2013 at 1:00

Attachments:

pom.xml

GoogleCodeExporter commented 9 years ago

The ESA indexes you have downloaded are VectorIndexes that store for each word 
the full vector over the document space, i.e. when you want to compute the 
similarity between two words the similarity measure just retrieves the vectors 
stored for the words and computes the similarity. You need a VectorIndexReader 
to load them and you cannot change the vectors anymore (e.g. normalize or use 
some weighting scheme).

LuceneVectorReader are used to create the above vectors from a Lucene index 
build from a document collection. When a similarity measure wants to get a 
vector for a word, it queries Lucene for the term frequencies of that word in 
each document. This is more flexible, as we can do all sorts of weighting and 
normalizing, but it is also much, much slower.

I have also added a similar text to the wiki page in order to avoid future 
confusion.

Original comment by torsten....@gmail.com on 18 Oct 2013 at 2:01

Changed state: Done

GoogleCodeExporter commented 9 years ago

In order to document that here:
If you want to create a ESA measure from the downloaded files use

TextSimilarityMeasure measure = new VectorComparator(new CachingVectorReader(
                new VectorIndexReader(new File(modelLocation)),
                Integer.parseInt(cacheSize)));

Original comment by torsten....@gmail.com on 18 Oct 2013 at 2:03

GoogleCodeExporter commented 9 years ago

Thanks it worked perfectly with all three dumps wordnet, wikipedia and 
wiktionary.

Just one comment though, some more javadoc would have been better say on 
choosing right cacheSize and its impact in association with vector getting 
loaded (if any). I am currently simulating with different cacheSizes and 
profiling my application to check memory footprints.

Thanks again.

Original comment by harsh.11...@gmail.com on 18 Oct 2013 at 4:05

GoogleCodeExporter commented 9 years ago

Good to hear that it works now.

It would be nice if you could report your findings on the cache size here. I am 
not sure we ever really tested that consistently.

Original comment by torsten....@gmail.com on 18 Oct 2013 at 4:18

sourish-rygbee / dkpro-similarity-asl

Invalid ESA vector index dumps at https://code.google.com/p/dkpro-similarity-asl/downloads/list #19