Unicode Error for single word queries on FTS3

ramcharran commented 7 years ago

All the advanced queries in my application work, except for single word searches like - SELECT title, book, author, link, snippet(text_idx) FROM text_idx WHERE text_idx MATCH 'possumus'; and OR searches like SELECT title, book, author, link, snippet(text_idx) FROM text_idx WHERE text_idx MATCH 'quam OR Galliae';

The application exits at line cursor.fetchall()

with the following error when the queries similar to the ones mentioned above are run. UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 105: invalid continuation byte

hideaki-t commented 7 years ago

Hi,

I need more information. I could not reproduce it. as far as I've seen, there is no such byte(0xe2) in the MATCH clauses, also they are not that long(position 105).

can you provide more details? for example,

a full stack trace, I cannot see which function raised it
a minimum sample code/data to reproduce the problem

also can you try another query to hit the same expected contents?

it can be caused by snippet function if it returns an unexpected text segments. maybe you can try to cast the snippet as a BLOB like CAST(snippet(text_idx) AS BLOB).

ramcharran commented 7 years ago

My code is pretty much the following :

import sqlitefts as fts

connection = apsw.Connection('texts.db', flags=apsw.SQLITE_OPEN_READWRITE)
c = connection.cursor()
c.execute('SELECT title, book, author, link, snippet(text_idx) FROM text_idx  WHERE text_idx MATCH 'possumus';')
c.fetchall()
--display the results on html--

the stack trace is:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1997, in __call__
    return self.wsgi_app(environ, start_response)
  File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1985, in wsgi_app
    response = self.handle_exception(e)
  File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1540, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.5/dist-packages/flask/_compat.py", line 33, in reraise
    raise value
  File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1982, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1614, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1517, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.5/dist-packages/flask/_compat.py", line 33, in reraise
    raise value
  File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1612, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1598, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/home/ramcharran/phyllo/search/app.py", line 123, in search
    r = do_search(query)
  File "/home/ramcharran/phyllo/search/app.py", line 90, in do_search
    r1 = c.fetchall()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 105: invalid continuation byte

ramcharran commented 7 years ago

I tried CAST(snippet(text_idx) AS BLOB) it gets me the following stacktrace:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/sqlitefts/tokenizer.py", line 146, in xopen
    tokens = tokenizer.tokenize(ffi.string(pInput).decode('utf-8'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb8 in position 4031: invalid start byte

hideaki-t commented 7 years ago

I am not sure it is the same issue as yours, but I could reproduce similar problem by returning invalid offset from a tokenizer.

I made a change in debug_offsets branch to debug this issue. make_tokenizer_module has an extra argument debug, if it is True, it checks offsets returned from a tokenizer. if it detects a wrong offset, it will abort the query. see https://github.com/hideaki-t/sqlite-fts-python/blob/eeb7ed85d75bd73ffd87ef75a959847b07053ce4/tests/test_base.py#L290

ramcharran commented 7 years ago

So decoding error is related to invalid tokenizer error as well? Also, the query works just fine if the search item is made of two or more terms.

hideaki-t commented 7 years ago

could be, but I am not sure. at least, it can cause it if the text contains non ASCII chars. but maybe there is something I am missing.

if a returned offset from a tokenizer is wrong, SQLite will use the wrong offset to populate a text for snippet. A text is a byte string in SQLite level, so a wrong offset does not crash it unless it exceeds the text length. when the byte slice, which is cut by the wrong offset, comes to Python level, it needs to be decoded to convert to a Unicode string, then the decoder(utf-8 bytes → unicode str) complains that it cannot decode the given byte string.

yeah, I have no idea why searching by words works fine. I cannot reproduce this issue in other ways so far. Do contents returned by a multiple words query contain non-ASCII chars? if so the snippet was what you expected?

hideaki-t commented 7 years ago

fts3tokenize can be used to check tokenizer output. see https://sqlite.org/fts3.html#querying_tokenizers also offsets function can be used as well. https://sqlite.org/fts3.html#the_offsets_function

Using a pair of position and length from "offsets" against the text(in utf-8) should point a valid UTF-8 string. https://github.com/hideaki-t/sqlite-fts-python/blob/8568a348ff773a4c27b0dd5b2d8b3168ddd1381a/tests/test_base.py#L313-L320

ramcharran commented 7 years ago

Is there any other way to change the default decoding to UTF16 or latin1?

hideaki-t commented 7 years ago

No. Since the standard sqlite3 module(and it seems like APSW as well) uses UTF-8 for a string, I've never had the idea to use an encoding other than UTF-8.

What should be decoded as UTF16 or Latin1 string? Is your data stored in an encoding other than utf-8? It can be, but it may be tricky. assume a DB uses UTF-16 to store text, SQLite calls a tokenizer with following, and this module has to decoded it to a unicode string, and pass it to a user defined tokenizer.

a text in MATCH clause - as a UTF-8 byte string
a text for the matched records to get snippet - as a UTF-16 byte string

ramcharran commented 7 years ago

the query for a single word works without using the snippet. But in my application i need to use it to highlight the search in the result. Any other alternative to snippet?

hideaki-t commented 7 years ago

You have access to the original text, so you can implement your own snippet function. for single word query, it can be done easily, just searching by the query word in the matched text column(s).

if the tokenizer returns valid position and offset, it should be fine. I know sometimes it is not straightforward if a back tokenizer does not returns byte offset. https://github.com/hideaki-t/sqlite-fts-python/blob/master/tests/test_igo.py#L17 https://github.com/hideaki-t/sqlite-fts-python/blob/master/tests/test_janome.py#L15

ramcharran commented 7 years ago

You mean, I can edit the tokenizer.py or something esle?

hideaki-t commented 7 years ago

I meant if your tokenizer yields a wrong offsets, it should be fixed in order to get snippets. so, at first, you need to make sure your tokenizer yields valid offsets(position and length). Have you tried fts3tokenize virtual table or offsets function? You have the original data and your tokenizer, and you know something with 'possumus' causes the issue. It can be checked simply by fts3tokenize. I have some unit test using fts3tokenizer https://github.com/hideaki-t/sqlite-fts-python/blob/531723139a9ac2ebbc38e0298e0b7d0a31ba068e/tests/test_apsw.py#L242-L249

and of course you can edit tokenizer.py, if you need you can fork it.

hideaki-t / sqlite-fts-python

Unicode Error for single word queries on FTS3 #10