Closed saaj closed 10 years ago
Thank you for pointing it out. I've just restarted working on this. what a coincidence :) I'll take a look.
Great :-) But I've hurried up with the report. The suggested tokenizer return of (inputToken, normalizedToken)
doesn't work with MATCH in general -- sqlite sends broken UTF-8 to tokenizer. I'm currently debugging it.
Updated the gist and the report.
Thanks!
I've merged your suggested changes and committed with changes by my side.
It broke Tokenizer.tokenize interface, existing test cases needed to be updated. so I applied your patch and
You're welcome.
I've looked at the changes. It looks good except one thing. It's now Python 3.4 only as there's no enum
module in an earlier version. enum
is mostly needed in advanced cases when you need to enumerate constants, translate them, etc. When it's just a matter of access enum
use is an overkill so I suggest just to leave class SQLiteResultCodes:
as it gives wide compatibility and doesn't make the code less readable.
Also I'd like to share some experience. When I've come to the changes from the gist, I started to test it with something close to real-world scenario and data. I had a database with ~2.5k documents, ~32MB of Unicode text converted from HTML. I added a real tokenizer and stemmer and tried to convert it to sqlite database with registered tokenizer.
At first it made me some trouble as I had serious SQLite errors (like DatabaseError: database disk image is malformed
) and even Python process crashes ("double free or corruption" from libsqlite3). Initially I thought it is something badly broken in the FFI ctypes
code. But after some analysis and debugging I realized that the problem was in broken Unicode which come from HTML text extraction issues. After I fixed encoding and made resulting token sanitization it was fine. So it's important to note that SQLite (3.7.15.2 in my case) is fragile to badly formed tokenizer output.
As the matter of benchmarking result, I didn't see any ctypes
code in profiling result. Most of time took morphology analysis.
So far so good. I'll open another issue with further suggestions.
Thank you for sharing your experiences and the note suggestion! I've never used this for any big data set yet.
About Enum. Yes, totally agree. I've removed Enum things. I think using Enum for this usage is overkill, doesn't make any improvement. Also it makes unnecessary dependencies for prior 3.4, and may cause performance penalty.
Absolutely. Good we have an easy consensus :-)
Also I'd like to note that
class SQLiteResultCodes:
SQLITE_OK = 0
SQLITE_DONE = 101
is good practice if you will to encapsulate the status codes. It is a little more readable than plain variables because class name serves self-documentation purpose. I was just about enum
usage. Though I leave it to your discretion.
I'll keep them as plain variables. In my point of view, SQLITE_OK and SQLITE_DONE are obviously status codes of SQLite in this situation.
First of all, the proof of concept is good as it deals with all the hard (for a Pythonista)
ctypes
stuff behind the scenes. Performance-wise it is yet to be tested but anyway it is helpful. Playing with it I've stumbled over a couple of bugs.Bugs
Infinite recursion in MATCH
The quote from
fts3_tokenizer.h
aboutxNext
:That says that the offsets deal with the token input, not normalized token. What consequence it has?
The test case leads to infinite recursion when executing SELECT query. It doesn't, however, in INSERT.
Empty normalized token
If normalized token is an empty string it should not be returned from
xNext
rather then processing should be advanced to next token1. The following fails withError: SQL logic error or missing database
.Suggested changes
For first bug I suggest to also return begin and end indices of input (pre-normalized) token, i.e.
(normalizedToken, inputBeginIndex, inputEndIndex)
. For second bug I suggest to implement empty tokens skip inxNext
. Here's the gist with the patch and test case.