seunginah / semanticvectors

Automatically exported from code.google.com/p/semanticvectors
0 stars 0 forks source link

ArrayIndexOutOfBoundsException in TermVectorsFromLucene, caused by non-optimized Lucene index #2

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Create a *non-optimized* Lucene index.
2. Try to run BuildIndex using this Lucene index as input.

Instead of successfully creating the term vectors, the following output can
happen:

Populating basic doc vector table ...
Creating term vectors ...
There are 30198 terms (and 924 docs)
0 ... Exception in thread "main"
java.lang.ArrayIndexOutOfBoundsException: 945
   at
pitt.search.semanticvectors.TermVectorsFromLucene.<init>(TermVectorsFromLucene.j
ava:107)
   at pitt.search.semanticvectors.BuildIndex.main(BuildIndex.java:91)

Occurred in version 1.0 of the software. We believe the problem is caused
by the fact that non-optimized Lucene indexes don't necessarily use
contiguous integers for the DocIDs, so it's possible for a DocID to be
bigger than numDocs, which is the bound used when the array in question is
declared.

Lionel posted a fix as follows:

IndexModifier modifier = new IndexModifier(indexDir, 
                             new StandardAnalyzer(), false);
modifier.optimize();
modifier.close();

Including this before the index is opened solves the problem. This fix is
now checked into version 1.1.

However, since many SV classes do depend on Lucene indexes in one way or
another, it's possible that issues like this could recur, so we should
watch out for them.

Original issue reported on code.google.com by maryl...@gmail.com on 7 Dec 2007 at 4:14

GoogleCodeExporter commented 9 years ago

Original comment by dwidd...@gmail.com on 7 Dec 2007 at 4:16