hideaki-t / sqlite-fts-python

A Python binding of SQLite Full Text Search Tokenizer
MIT License
45 stars 11 forks source link

Wrong xnext behaviour #1

Closed saaj closed 10 years ago

saaj commented 10 years ago

First of all, the proof of concept is good as it deals with all the hard (for a Pythonista) ctypes stuff behind the scenes. Performance-wise it is yet to be tested but anyway it is helpful. Playing with it I've stumbled over a couple of bugs.

Bugs

Infinite recursion in MATCH

The quote from fts3_tokenizer.h about xNext:

The input text that generated the token is identified by the byte offsets returned in piStartOffset and piEndOffset. piStartOffset should be set to the index of the first byte of the token in the input buffer. piEndOffset should be set to the index of the first byte just past the end of the token in the input buffer.

That says that the offsets deal with the token input, not normalized token. What consequence it has?

class Tokenizer(fts.Tokenizer):

  def tokenize(self, text):
    return (w[0:-1] for w in text.split(' '))

class TestCase(unittest.TestCase):

  def setUp(self):
    name = 'test'
    conn = sqlite3.connect(':memory:')

    fts.register_tokenizer(conn, name, fts.make_tokenizer_module(Tokenizer()))
    conn.execute('CREATE VIRTUAL TABLE fts USING FTS4(tokenize={})'.format(name))

    self.testee = conn

  def testInfiniteRecursion(self):
    contents = [('abc def',), ('abc xyz',)]
    result = self.testee.executemany('INSERT INTO fts VALUES(?)', contents)
    self.assertEqual(2, result.rowcount)

    result = self.testee.execute("SELECT * FROM fts WHERE fts MATCH 'abc'").fetchall()
    self.assertEqual(2, len(result))

The test case leads to infinite recursion when executing SELECT query. It doesn't, however, in INSERT.

Empty normalized token

If normalized token is an empty string it should not be returned from xNext rather then processing should be advanced to next token1. The following fails with Error: SQL logic error or missing database.

class TestCase(unittest.TestCase):

  def testZeroLengthToken(self):
    result = self.testee.executemany('INSERT INTO fts VALUES(?)', [('Make things I',)])
    self.assertEqual(1, result.rowcount)

Suggested changes

For first bug I suggest to also return begin and end indices of input (pre-normalized) token, i.e. (normalizedToken, inputBeginIndex, inputEndIndex). For second bug I suggest to implement empty tokens skip in xNext. Here's the gist with the patch and test case.

hideaki-t commented 10 years ago

Thank you for pointing it out. I've just restarted working on this. what a coincidence :) I'll take a look.

saaj commented 10 years ago

Great :-) But I've hurried up with the report. The suggested tokenizer return of (inputToken, normalizedToken) doesn't work with MATCH in general -- sqlite sends broken UTF-8 to tokenizer. I'm currently debugging it.

saaj commented 10 years ago

Updated the gist and the report.

hideaki-t commented 10 years ago

Thanks!

I've merged your suggested changes and committed with changes by my side.

It broke Tokenizer.tokenize interface, existing test cases needed to be updated. so I applied your patch and

saaj commented 10 years ago

You're welcome.

I've looked at the changes. It looks good except one thing. It's now Python 3.4 only as there's no enum module in an earlier version. enum is mostly needed in advanced cases when you need to enumerate constants, translate them, etc. When it's just a matter of access enum use is an overkill so I suggest just to leave class SQLiteResultCodes: as it gives wide compatibility and doesn't make the code less readable.

Also I'd like to share some experience. When I've come to the changes from the gist, I started to test it with something close to real-world scenario and data. I had a database with ~2.5k documents, ~32MB of Unicode text converted from HTML. I added a real tokenizer and stemmer and tried to convert it to sqlite database with registered tokenizer.

At first it made me some trouble as I had serious SQLite errors (like DatabaseError: database disk image is malformed) and even Python process crashes ("double free or corruption" from libsqlite3). Initially I thought it is something badly broken in the FFI ctypes code. But after some analysis and debugging I realized that the problem was in broken Unicode which come from HTML text extraction issues. After I fixed encoding and made resulting token sanitization it was fine. So it's important to note that SQLite (3.7.15.2 in my case) is fragile to badly formed tokenizer output.

As the matter of benchmarking result, I didn't see any ctypes code in profiling result. Most of time took morphology analysis.

So far so good. I'll open another issue with further suggestions.

hideaki-t commented 10 years ago

Thank you for sharing your experiences and the note suggestion! I've never used this for any big data set yet.

About Enum. Yes, totally agree. I've removed Enum things. I think using Enum for this usage is overkill, doesn't make any improvement. Also it makes unnecessary dependencies for prior 3.4, and may cause performance penalty.

saaj commented 10 years ago

Absolutely. Good we have an easy consensus :-)

Also I'd like to note that

class SQLiteResultCodes:
  SQLITE_OK = 0
  SQLITE_DONE = 101

is good practice if you will to encapsulate the status codes. It is a little more readable than plain variables because class name serves self-documentation purpose. I was just about enum usage. Though I leave it to your discretion.

hideaki-t commented 10 years ago

I'll keep them as plain variables. In my point of view, SQLITE_OK and SQLITE_DONE are obviously status codes of SQLite in this situation.