tjake / Solandra

Solandra = Solr + Cassandra
Apache License 2.0
882 stars 150 forks source link

Numeric fields are not properly stored or indexed #40

Closed tnine closed 13 years ago

tnine commented 14 years ago

Hi Jake, Take a look at my fork, I've added tests from Uwe's numeric tests on the lucene core. Only a handful of tests appear to be working. I'll be correcting this in my fork and I'll let you know when I'm done.

tnine commented 14 years ago

Hey Jake, I've investigated this further, and I have determined the issue. The LuceneTermEnum does not properly match the spec when enumerating numeric trie terms. I've added some debug output when using the default RamDirectory on version 2.9.3 and running the TestNumericRangeQuery32 tests. I receive this enumeration order when the "term()" method is invoked on their SegmentTermEnum class.

Returning term for field 'field8' hex value is : 60077f7e6814
Returning term for field 'field8' hex value is : 60077f7e6814
Returning term for field 'field8' hex value is : 60077f7e6814
Returning term for field 'field8' hex value is : 60077f7e6814
Returning term for field 'field8' hex value is : 68037f7f00
Returning term for field 'field8' hex value is : 68037f7f00
Returning term for field 'field8' hex value is : 68037f7f00
Returning term for field 'field8' hex value is : 68037f7f00
Returning term for field 'field8' hex value is : 68037f7f34
Returning term for field 'field8' hex value is : 68037f7f34
Returning term for field 'field8' hex value is : 68037f7f34
Returning term for field 'field8' hex value is : 68037f7f34
Returning term for field 'field8' hex value is : 68037f7f34
Returning term for field 'field8' hex value is : 68037f7f34
Returning term for field 'field8' hex value is : 68037f7f4e
Returning term for field 'field8' hex value is : 68037f7f34
Returning term for field 'field8' hex value is : 68037f7f4e
Returning term for field 'field8' hex value is : 68037f7f4e
Returning term for field 'field8' hex value is : 68037f7f68
Returning term for field 'field8' hex value is : 68037f7f4e
Returning term for field 'field8' hex value is : 68037f7f68
Returning term for field 'field8' hex value is : 68037f7f68
Returning term for field 'field8' hex value is : 6804000002
Returning term for field 'field8' hex value is : 68037f7f68
Returning term for field 'field8' hex value is : 70017f7f
Returning term for field 'field8' hex value is : 70017f7f
Returning term for field 'field8' hex value is : 70017f7f
Returning term for field 'field8' hex value is : 70017f7f
Returning term for field 'field8' hex value is : 78007f
Returning term for field 'field8' hex value is : 78007f
Returning term for field 'field8' hex value is : 78007f
Returning term for field 'field8' hex value is : 78007f
Returning term for field 'field8' hex value is : 780100
Returning term for field 'field8' hex value is : 780100
Returning term for field 'field8' hex value is : 780100
Returning term for field 'field8' hex value is : 780100
Returning term for field 'field8' hex value is : 780100
Returning term for field 'field8' hex value is : 780100

These are the results with LucandraTermEnum

Returning term for field 'field8' hex value is : 60077f7e6814
Returning term for field 'field8' hex value is : 600809433244
Returning term for field 'field8' hex value is : 68037f7f34
Returning term for field 'field8' hex value is : 68037f7f34
Returning term for field 'field8' hex value is : 68037f7f4e
Returning term for field 'field8' hex value is : 68037f7f4e
Returning term for field 'field8' hex value is : 68037f7f68
Returning term for field 'field8' hex value is : 68037f7f68
Returning term for field 'field8' hex value is : 6804000002
Returning term for field 'field8' hex value is : 6804046008
Returning term for field 'field8' hex value is : 6804046008

As you can see the results are not properly enumerated. Given that you're using a Tree for the cached terms, they should be ordered properly after insert. It seems that this may be an issue with the way loadTerms is invoked

tnine commented 14 years ago

Hi Jake, I've been digging into this one all day. After searching a bit more, I found an issue in my local copy of the TermEnum which I have corrected. This resolves the enumeration issue I described above. However, the documents are not returned in "default" order. I.E. the order they were added to the index as the test expects. Im assuming this is a bug in the LucandraTermDocs, but I'm having a hard time locating it. Thoughts?

tnine commented 14 years ago

I've updated my test case on my fork that shows the issue.

http://github.com/tnine/Lucandra/blob/master/test/lucandra/NumericRangeTests.java

It appears to still be term enum related. The calls to IndexReader.addDocument are occurring in a different order than the insertion.

tjake commented 13 years ago

fixed.