mchaput / whoosh

Pure-Python full-text search library
Other
580 stars 73 forks source link

reader.field_terms raises OverflowError for DATETIME fields #24

Open Quemoy opened 2 years ago

Quemoy commented 2 years ago

When attempting to use field_terms function of whoosh.reading.IndexReader class, an exception is raised. Using the following index and documents as a test

from whoosh import index
from whoosh.fields import  Schema, TEXT, ID, DATETIME
from pathlib import Path
from datetime import datetime

schema = Schema(doc_ID=ID(unique=True, stored=True),
                m_date=DATETIME(stored=True),
                content=TEXT(stored=True)
                )

idx_fp = Path.home()/'test'
ix = index.create_in(idx_fp, schema)

with ix.writer() as wr:
    wr.add_document(doc_ID='A1', m_date=datetime(2020,1,1), content='doc A1')
    wr.add_document(doc_ID='B2', m_date=datetime(1985,3,15), content='doc B2')
    wr.add_document(doc_ID='C3', m_date=datetime(2019,12,8), content='doc C3')
    wr.add_document(doc_ID='D4', m_date=datetime(1977,1,1), content='doc D4')

read = ix.reader()

When attempting to retrieve all terms for the m_date field via list(read.field_terms('m_date')), an OverflowError is raised with the following message:

---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-2-da69936def15> in <module>
----> 1 list(read.field_terms('m_date'))

~/.conda/envs/whoosh/lib/python3.7/site-packages/whoosh/reading.py in field_terms(self, fieldname)
    260         from_bytes = self.schema[fieldname].from_bytes
    261         for btext in self.lexicon(fieldname):
--> 262             yield from_bytes(btext)
    263 
    264     def __iter__(self):

~/.conda/envs/whoosh/lib/python3.7/site-packages/whoosh/fields.py in from_bytes(self, bs)
    843     def from_bytes(self, bs):
    844         x = NUMERIC.from_bytes(self, bs)
--> 845         return long_to_datetime(x)
    846 
    847     def _parse_datestring(self, qstring):

~/.conda/envs/whoosh/lib/python3.7/site-packages/whoosh/util/times.py in long_to_datetime(x)
     87     x -= seconds * 1000000
     88 
---> 89     return datetime.min + timedelta(days=days, seconds=seconds, microseconds=x)
     90 
     91 

OverflowError: date value out of range

It seems that this is caused by a DATETIME field having extra terms added on by whoosh. For example, printing the result of read.all_terms() gives the following:

('content', b'a1')
('content', b'b2')
('content', b'c3')
('content', b'd4')
('content', b'doc')
('doc_ID', b'A1')
('doc_ID', b'B2')
('doc_ID', b'C3')
('doc_ID', b'D4')
('m_date', b'\x00\x80\xdd\x88\xed\x0fe\xa0\x00')
('m_date', b'\x00\x80\xdetF.\x1d\xc0\x00')
('m_date', b'\x00\x80\xe2Y$\xf4\xf5\x00\x00')
('m_date', b'\x00\x80\xe2[\x07\xc1&\x00\x00')
('m_date', b'\x08\x00\x80\xdd\x88\xed\x0fe\xa0')
('m_date', b'\x08\x00\x80\xdetF.\x1d\xc0')
('m_date', b'\x08\x00\x80\xe2Y$\xf4\xf5\x00')
('m_date', b'\x08\x00\x80\xe2[\x07\xc1&\x00')
('m_date', b'\x10\x00\x00\x80\xdd\x88\xed\x0fe')
('m_date', b'\x10\x00\x00\x80\xdetF.\x1d')
('m_date', b'\x10\x00\x00\x80\xe2Y$\xf4\xf5')
('m_date', b'\x10\x00\x00\x80\xe2[\x07\xc1&')
('m_date', b'\x18\x00\x00\x00\x80\xdd\x88\xed\x0f')
('m_date', b'\x18\x00\x00\x00\x80\xdetF.')
('m_date', b'\x18\x00\x00\x00\x80\xe2Y$\xf4')
('m_date', b'\x18\x00\x00\x00\x80\xe2[\x07\xc1')
('m_date', b' \x00\x00\x00\x00\x80\xdd\x88\xed')
('m_date', b' \x00\x00\x00\x00\x80\xdetF')
('m_date', b' \x00\x00\x00\x00\x80\xe2Y$')
('m_date', b' \x00\x00\x00\x00\x80\xe2[\x07')
('m_date', b'(\x00\x00\x00\x00\x00\x80\xdd\x88')
('m_date', b'(\x00\x00\x00\x00\x00\x80\xdet')
('m_date', b'(\x00\x00\x00\x00\x00\x80\xe2Y')
('m_date', b'(\x00\x00\x00\x00\x00\x80\xe2[')
('m_date', b'0\x00\x00\x00\x00\x00\x00\x80\xdd')
('m_date', b'0\x00\x00\x00\x00\x00\x00\x80\xde')
('m_date', b'0\x00\x00\x00\x00\x00\x00\x80\xe2')
('m_date', b'8\x00\x00\x00\x00\x00\x00\x00\x80')

The terms of the content and doc_ID field all makes sense given the documents, but the m_date field has extra terms stored in the index. Printing out the content of the m_date field using the following code:

dt = ix.schema['m_date']
for x in read.lexicon('m_date'):
    print(repr(dt.from_bytes(x)))

This prints four datetime values (which are the values for the documents)

datetime.datetime(1977, 1, 1, 0, 0)
datetime.datetime(1985, 3, 15, 0, 0)
datetime.datetime(2019, 12, 8, 0, 0)
datetime.datetime(2020, 1, 1, 0, 0)

...before running into an overflow exception on the fifth term, b'\x08\x00\x80\xdd\x88\xed\x0fe\xa0'. Using the code that's in the NUMERIC class,

In [4]: from whoosh.util.numeric import from_sortable
In [5]: dt._struct.unpack(b'\x08\x00\x80\xdd\x88\xed\x0fe\xa0'[1:])[0]
Out[5]: 36272377181463968
In [6]: from_sortable(dt.numtype, dt.bits, dt.signed, 36272377181463968)
Out[6]: -9187099659673311840

Obviously this raises an exception since python dates cannot be before year 0.

Do we know what the extra terms added for the DATETIME field in the index is?