When attempting to use field_terms function of whoosh.reading.IndexReader class, an exception is raised. Using the following index and documents as a test
from whoosh import index
from whoosh.fields import Schema, TEXT, ID, DATETIME
from pathlib import Path
from datetime import datetime
schema = Schema(doc_ID=ID(unique=True, stored=True),
m_date=DATETIME(stored=True),
content=TEXT(stored=True)
)
idx_fp = Path.home()/'test'
ix = index.create_in(idx_fp, schema)
with ix.writer() as wr:
wr.add_document(doc_ID='A1', m_date=datetime(2020,1,1), content='doc A1')
wr.add_document(doc_ID='B2', m_date=datetime(1985,3,15), content='doc B2')
wr.add_document(doc_ID='C3', m_date=datetime(2019,12,8), content='doc C3')
wr.add_document(doc_ID='D4', m_date=datetime(1977,1,1), content='doc D4')
read = ix.reader()
When attempting to retrieve all terms for the m_date field via list(read.field_terms('m_date')), an OverflowError is raised with the following message:
---------------------------------------------------------------------------
OverflowError Traceback (most recent call last)
<ipython-input-2-da69936def15> in <module>
----> 1 list(read.field_terms('m_date'))
~/.conda/envs/whoosh/lib/python3.7/site-packages/whoosh/reading.py in field_terms(self, fieldname)
260 from_bytes = self.schema[fieldname].from_bytes
261 for btext in self.lexicon(fieldname):
--> 262 yield from_bytes(btext)
263
264 def __iter__(self):
~/.conda/envs/whoosh/lib/python3.7/site-packages/whoosh/fields.py in from_bytes(self, bs)
843 def from_bytes(self, bs):
844 x = NUMERIC.from_bytes(self, bs)
--> 845 return long_to_datetime(x)
846
847 def _parse_datestring(self, qstring):
~/.conda/envs/whoosh/lib/python3.7/site-packages/whoosh/util/times.py in long_to_datetime(x)
87 x -= seconds * 1000000
88
---> 89 return datetime.min + timedelta(days=days, seconds=seconds, microseconds=x)
90
91
OverflowError: date value out of range
It seems that this is caused by a DATETIME field having extra terms added on by whoosh. For example, printing the result of read.all_terms() gives the following:
The terms of the content and doc_ID field all makes sense given the documents, but the m_date field has extra terms stored in the index. Printing out the content of the m_date field using the following code:
dt = ix.schema['m_date']
for x in read.lexicon('m_date'):
print(repr(dt.from_bytes(x)))
This prints four datetime values (which are the values for the documents)
...before running into an overflow exception on the fifth term, b'\x08\x00\x80\xdd\x88\xed\x0fe\xa0'. Using the code that's in the NUMERIC class,
In [4]: from whoosh.util.numeric import from_sortable
In [5]: dt._struct.unpack(b'\x08\x00\x80\xdd\x88\xed\x0fe\xa0'[1:])[0]
Out[5]: 36272377181463968
In [6]: from_sortable(dt.numtype, dt.bits, dt.signed, 36272377181463968)
Out[6]: -9187099659673311840
Obviously this raises an exception since python dates cannot be before year 0.
Do we know what the extra terms added for the DATETIME field in the index is?
When attempting to use
field_terms
function ofwhoosh.reading.IndexReader
class, an exception is raised. Using the following index and documents as a testWhen attempting to retrieve all terms for the
m_date
field vialist(read.field_terms('m_date'))
, anOverflowError
is raised with the following message:It seems that this is caused by a
DATETIME
field having extra terms added on by whoosh. For example, printing the result ofread.all_terms()
gives the following:The terms of the
content
anddoc_ID
field all makes sense given the documents, but them_date
field has extra terms stored in the index. Printing out the content of them_date
field using the following code:This prints four datetime values (which are the values for the documents)
...before running into an overflow exception on the fifth term,
b'\x08\x00\x80\xdd\x88\xed\x0fe\xa0'
. Using the code that's in theNUMERIC
class,Obviously this raises an exception since python dates cannot be before year 0.
Do we know what the extra terms added for the
DATETIME
field in the index is?