Closed ramcharran closed 6 years ago
Hi,
I need more information. I could not reproduce it. as far as I've seen, there is no such byte(0xe2) in the MATCH clauses, also they are not that long(position 105).
can you provide more details? for example,
also can you try another query to hit the same expected contents?
it can be caused by snippet function if it returns an unexpected text segments. maybe you can try to cast the snippet as a BLOB like CAST(snippet(text_idx) AS BLOB)
.
My code is pretty much the following :
import sqlitefts as fts
connection = apsw.Connection('texts.db', flags=apsw.SQLITE_OPEN_READWRITE)
c = connection.cursor()
c.execute('SELECT title, book, author, link, snippet(text_idx) FROM text_idx WHERE text_idx MATCH 'possumus';')
c.fetchall()
--display the results on html--
the stack trace is:
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1997, in __call__
return self.wsgi_app(environ, start_response)
File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1985, in wsgi_app
response = self.handle_exception(e)
File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1540, in handle_exception
reraise(exc_type, exc_value, tb)
File "/usr/local/lib/python3.5/dist-packages/flask/_compat.py", line 33, in reraise
raise value
File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1982, in wsgi_app
response = self.full_dispatch_request()
File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1614, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1517, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/usr/local/lib/python3.5/dist-packages/flask/_compat.py", line 33, in reraise
raise value
File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1612, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1598, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/home/ramcharran/phyllo/search/app.py", line 123, in search
r = do_search(query)
File "/home/ramcharran/phyllo/search/app.py", line 90, in do_search
r1 = c.fetchall()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 105: invalid continuation byte
I tried CAST(snippet(text_idx) AS BLOB)
it gets me the following stacktrace:
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/sqlitefts/tokenizer.py", line 146, in xopen
tokens = tokenizer.tokenize(ffi.string(pInput).decode('utf-8'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb8 in position 4031: invalid start byte
I am not sure it is the same issue as yours, but I could reproduce similar problem by returning invalid offset from a tokenizer.
I made a change in debug_offsets branch to debug this issue.
make_tokenizer_module
has an extra argument debug
, if it is True, it checks offsets returned from a tokenizer. if it detects a wrong offset, it will abort the query.
see https://github.com/hideaki-t/sqlite-fts-python/blob/eeb7ed85d75bd73ffd87ef75a959847b07053ce4/tests/test_base.py#L290
So decoding error is related to invalid tokenizer error as well? Also, the query works just fine if the search item is made of two or more terms.
could be, but I am not sure. at least, it can cause it if the text contains non ASCII chars. but maybe there is something I am missing.
if a returned offset from a tokenizer is wrong, SQLite will use the wrong offset to populate a text for snippet. A text is a byte string in SQLite level, so a wrong offset does not crash it unless it exceeds the text length. when the byte slice, which is cut by the wrong offset, comes to Python level, it needs to be decoded to convert to a Unicode string, then the decoder(utf-8 bytes → unicode str) complains that it cannot decode the given byte string.
yeah, I have no idea why searching by words works fine. I cannot reproduce this issue in other ways so far. Do contents returned by a multiple words query contain non-ASCII chars? if so the snippet was what you expected?
fts3tokenize can be used to check tokenizer output. see https://sqlite.org/fts3.html#querying_tokenizers also offsets function can be used as well. https://sqlite.org/fts3.html#the_offsets_function
Using a pair of position and length from "offsets" against the text(in utf-8) should point a valid UTF-8 string. https://github.com/hideaki-t/sqlite-fts-python/blob/8568a348ff773a4c27b0dd5b2d8b3168ddd1381a/tests/test_base.py#L313-L320
Is there any other way to change the default decoding to UTF16 or latin1?
No. Since the standard sqlite3 module(and it seems like APSW as well) uses UTF-8 for a string, I've never had the idea to use an encoding other than UTF-8.
What should be decoded as UTF16 or Latin1 string? Is your data stored in an encoding other than utf-8? It can be, but it may be tricky. assume a DB uses UTF-16 to store text, SQLite calls a tokenizer with following, and this module has to decoded it to a unicode string, and pass it to a user defined tokenizer.
the query for a single word works without using the snippet. But in my application i need to use it to highlight the search in the result. Any other alternative to snippet?
You have access to the original text, so you can implement your own snippet function. for single word query, it can be done easily, just searching by the query word in the matched text column(s).
if the tokenizer returns valid position and offset, it should be fine. I know sometimes it is not straightforward if a back tokenizer does not returns byte offset. https://github.com/hideaki-t/sqlite-fts-python/blob/master/tests/test_igo.py#L17 https://github.com/hideaki-t/sqlite-fts-python/blob/master/tests/test_janome.py#L15
You mean, I can edit the tokenizer.py or something esle?
I meant if your tokenizer yields a wrong offsets, it should be fixed in order to get snippets. so, at first, you need to make sure your tokenizer yields valid offsets(position and length). Have you tried fts3tokenize virtual table or offsets function? You have the original data and your tokenizer, and you know something with 'possumus' causes the issue. It can be checked simply by fts3tokenize. I have some unit test using fts3tokenizer https://github.com/hideaki-t/sqlite-fts-python/blob/531723139a9ac2ebbc38e0298e0b7d0a31ba068e/tests/test_apsw.py#L242-L249
and of course you can edit tokenizer.py, if you need you can fork it.
All the advanced queries in my application work, except for single word searches like -
SELECT title, book, author, link, snippet(text_idx) FROM text_idx WHERE text_idx MATCH 'possumus';
and OR searches likeSELECT title, book, author, link, snippet(text_idx) FROM text_idx WHERE text_idx MATCH 'quam OR Galliae';
The application exits at line
cursor.fetchall()
with the following error when the queries similar to the ones mentioned above are run.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 105: invalid continuation byte