kaykay-dv / pocketsearch

A simple full-text search library for Python using SQLite and its FTS5 extension
https://pocketsearch.readthedocs.io/en/latest/
MIT License
1 stars 0 forks source link

Document count in .tokens method #60

Closed kaykay-dv closed 6 months ago

kaykay-dv commented 6 months ago

It seems the "num_documents" property returned by .tokens displays the wrong the number of documents. It seems to be an over-estimate. E.g. when indexing a corpus of 160.000 documents, the most common token ("the") appears in 321942 documents according to the statistics which is obviously wrong.

kaykay-dv commented 6 months ago

The problem has been identified in the insert_or_update method of the PocketSearch class. insert_or_update should use the internal rowid identifier to update existing entries not a unique ID field provided by the user. When using a custom unique ID field, the token table does not get updated correctly.

kaykay-dv commented 6 months ago

Fixed in 0.30.0