hideaki-t / sqlite-fts-python

A Python binding of SQLite Full Text Search Tokenizer
MIT License
45 stars 11 forks source link

Doesn't work with SQLite 3.37 #30

Open macabrus opened 1 year ago

macabrus commented 1 year ago

I tried using it on mac with sqlite compiled with fts5 extension built in:

sqlite --version
3.37.0 2021-12-09 01:34:53 9ff244ce0739f8ee52a3e9671adb4ee54c83c640b02e3f9d185fd2f9a179aapl

sqlite> PRAGMA compile_options;
┌──────────────────────────────────┐
│         compile_options          │
├──────────────────────────────────┤
│ ATOMIC_INTRINSICS=1              │
│ BUG_COMPATIBLE_20160819          │
│ COMPILER=clang-13.1.6            │
│ DEFAULT_AUTOVACUUM               │
│ DEFAULT_CACHE_SIZE=2000          │
│ DEFAULT_CKPTFULLFSYNC            │
│ DEFAULT_FILE_FORMAT=4            │
│ DEFAULT_JOURNAL_SIZE_LIMIT=32768 │
│ DEFAULT_LOOKASIDE=1200,102       │
│ DEFAULT_MEMSTATUS=0              │
│ DEFAULT_MMAP_SIZE=0              │
│ DEFAULT_PAGE_SIZE=4096           │
│ DEFAULT_PCACHE_INITSZ=20         │
│ DEFAULT_RECURSIVE_TRIGGERS       │
│ DEFAULT_SECTOR_SIZE=4096         │
│ DEFAULT_SYNCHRONOUS=2            │
│ DEFAULT_WAL_AUTOCHECKPOINT=1000  │
│ DEFAULT_WAL_SYNCHRONOUS=1        │
│ DEFAULT_WORKER_THREADS=0         │
│ ENABLE_API_ARMOR                 │
│ ENABLE_BYTECODE_VTAB             │
│ ENABLE_COLUMN_METADATA           │
│ ENABLE_DBPAGE_VTAB               │
│ ENABLE_DBSTAT_VTAB               │
│ ENABLE_EXPLAIN_COMMENTS          │
│ ENABLE_FTS3                      │
│ ENABLE_FTS3_PARENTHESIS          │
│ ENABLE_FTS3_TOKENIZER            │
│ ENABLE_FTS4                      │
│ ENABLE_FTS5                      │
│ ENABLE_JSON1                     │
│ ENABLE_LOCKING_STYLE=1           │
│ ENABLE_NORMALIZE                 │
│ ENABLE_PREUPDATE_HOOK            │
│ ENABLE_RTREE                     │
│ ENABLE_SESSION                   │
│ ENABLE_SNAPSHOT                  │
│ ENABLE_SQLLOG                    │
│ ENABLE_STMT_SCANSTATUS           │
│ ENABLE_UNKNOWN_SQL_FUNCTION      │
│ ENABLE_UPDATE_DELETE_LIMIT       │
│ HAVE_ISNAN                       │
│ MALLOC_SOFT_LIMIT=1024           │
│ MAX_ATTACHED=10                  │
│ MAX_COLUMN=2000                  │
│ MAX_COMPOUND_SELECT=500          │
│ MAX_DEFAULT_PAGE_SIZE=8192       │
│ MAX_EXPR_DEPTH=1000              │
│ MAX_FUNCTION_ARG=127             │
│ MAX_LENGTH=2147483645            │
│ MAX_LIKE_PATTERN_LENGTH=50000    │
│ MAX_MMAP_SIZE=1073741824         │
│ MAX_PAGE_COUNT=1073741823        │
│ MAX_PAGE_SIZE=65536              │
│ MAX_SQL_LENGTH=1000000000        │
│ MAX_TRIGGER_DEPTH=1000           │
│ MAX_VARIABLE_NUMBER=500000       │
│ MAX_VDBE_OP=250000000            │
│ MAX_WORKER_THREADS=8             │
│ MUTEX_UNFAIR                     │
│ OMIT_AUTORESET                   │
│ OMIT_LOAD_EXTENSION              │
│ STMTJRNL_SPILL=131072            │
│ SYSTEM_MALLOC                    │
│ TEMP_STORE=1                     │
│ THREADSAFE=2                     │
│ USE_URI                          │
└──────────────────────────────────┘

I am registering custom tokenizer as follows:

@asynccontextmanager
async def make_conn() -> Connection:
    async with aiosqlite.connect('test.db',
                                 detect_types=sqlite3.PARSE_DECLTYPES | sqlite3.PARSE_COLNAMES,
                                 check_same_thread=False) as db:
        try:
            await db.enable_load_extension(True)
            pragmas = await db.execute_fetchall('pragma compile_options;')
            if ('ENABLE_FTS5',) not in pragmas:
                await db.load_extension('fts5')
            await db.load_extension('mod_spatialite')
            await db.execute('pragma journal_mode=WAL')
            # hacky... but shouldn't be used really...
            # we need to register it in async manner...
            fts5.register_tokenizer(db._connection, 'croatian_generic', custom_tokenizer_module)
            db.row_factory = aiosqlite.Row
            yield db
            await db.commit()
        except Exception as e:
            await db.execute('rollback')
            raise e

But I get following error:

Traceback (most recent call last):
  File "/Users/.../Projects/.../.../sqlite-testing/src/db.py", line 32, in make_conn
    fts5.register_tokenizer(db._connection, 'croatian_generic', cro_tokenizer_mod)
  File "/Users/.../Projects/.../.../sqlite-testing/.venv/lib/python3.10/site-packages/sqlitefts/fts5.py", line 171, in register_tokenizer
    r = fts5api.xCreateTokenizer(fts5api,
AttributeError: cdata 'fts5_api * *' has no attribute 'xCreateTokenizer'

Reproduced the same issue in full working example without aiosqlite module:

import sqlite3

from sqlitefts import fts5

class CroatianTokenizer(fts5.FTS5Tokenizer):
    pattern = re.compile(r'\w+', re.UNICODE)

    def tokenize(self, text, flags=None):
        for match in self.pattern.finditer(text):
            start, end = match.span()
            token = text[start:end]
            length = len(token.encode('utf-8'))
            position = len(text[:start].encode('utf-8'))
            print(f'token: {token} {position} {position + length}')
            yield token, position, position + length

cro_tokenizer_mod = fts5.make_fts5_tokenizer(CroatianTokenizer())

with sqlite3.connect('simple.db') as db:
    fts5.register_tokenizer(db, 'croatian_generic', cro_tokenizer_mod)
    db.execute("create virtual table test using fts5(a,b,c,tokenize='croatian_generic')")
    data = [
        ('a b c d e', 'f g h i j', 'k l m n o'),
        ('a b c d e', 'f g h i j', 'k l m n o'),
        ('a b c d e', 'f g h i j', 'k l m n o'),
        ('a b c d e', 'f g h i j', 'k l m n o')
    ]
    db.executemany('insert into test values(?, ?, ?)', data)
    with db.cursor() as cur:
        for row in cur.execute('select * from test where test match ?', ('a',)):
            print(row)

EDIT: I suspect it might have to do with breaking changes introduced in 3.20. (see: point 3.)

macabrus commented 1 year ago

I narrowed it down to:

rc = dll.sqlite3_prepare_v2(db, b'SELECT fts5(?1)', -1, pStmt, ffi.NULL)
assert rc == 21 # SQLITE_MISUSE?

here

hideaki-t commented 1 year ago

Thank you for the report, and information i will take a look.

hideaki-t commented 1 year ago

Ok, I confirmed this issue in GitHub Actions (https://github.com/actions/runner-images/blob/main/images/macos/macos-11-Readme.md macos-11/Python 3.10.8/SQLite 3.39.4, likely the latest official release for 3.10.x https://github.com/python/cpython/pull/98639).

I realized that maybe this is the issue I couldn't solve and the reason I switched using self build SQLite for macOS. (see I was strgging https://github.com/hideaki-t/sqlite-fts-python/pull/26/commits/6787748aaaa3e41b5e8547f7eae54db7721e4166, https://github.com/hideaki-t/sqlite-fts-python/pull/26/commits/88f57100ee26460f27c1efa0819083f47a86855c, https://github.com/hideaki-t/sqlite-fts-python/pull/26/commits/bdf9c531a91669f3716bff4bffa10bea4c43affe ...)

can you check the SQLite version and its compile options used by Python?

❯ python
Python 3.10.8 (main, Nov  1 2022, 14:18:21) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sqlite3, pprint
>>> sqlite3.connect(':memory:').execute('select sqlite_version()').fetchone()
('3.40.0',)
>>> pprint.pprint(sqlite3.connect(':memory:').execute('pragma compile_options').fetchall())
[('ATOMIC_INTRINSICS=1',),
 ('COMPILER=gcc-12.2.0',),
 ('DEFAULT_AUTOVACUUM',),
 ('DEFAULT_CACHE_SIZE=-2000',),
 ('DEFAULT_FILE_FORMAT=4',),
 ('DEFAULT_JOURNAL_SIZE_LIMIT=-1',),
 ('DEFAULT_MMAP_SIZE=0',),
 ('DEFAULT_PAGE_SIZE=4096',),
 ('DEFAULT_PCACHE_INITSZ=20',),
 ('DEFAULT_RECURSIVE_TRIGGERS',),
 ('DEFAULT_SECTOR_SIZE=4096',),
 ('DEFAULT_SYNCHRONOUS=2',),
 ('DEFAULT_WAL_AUTOCHECKPOINT=1000',),
 ('DEFAULT_WAL_SYNCHRONOUS=2',),
 ('DEFAULT_WORKER_THREADS=0',),
 ('ENABLE_COLUMN_METADATA',),
 ('ENABLE_DBSTAT_VTAB',),
 ('ENABLE_FTS3',),
 ('ENABLE_FTS3_TOKENIZER',),
 ('ENABLE_FTS4',),
 ('ENABLE_FTS5',),
 ('ENABLE_MATH_FUNCTIONS',),
 ('ENABLE_RTREE',),
 ('ENABLE_STMTVTAB',),
 ('ENABLE_UNLOCK_NOTIFY',),
 ('HAVE_ISNAN',),
 ('MALLOC_SOFT_LIMIT=1024',),
 ('MAX_ATTACHED=10',),
 ('MAX_COLUMN=2000',),
 ('MAX_COMPOUND_SELECT=500',),
 ('MAX_DEFAULT_PAGE_SIZE=8192',),
 ('MAX_EXPR_DEPTH=10000',),
 ('MAX_FUNCTION_ARG=127',),
 ('MAX_LENGTH=1000000000',),
 ('MAX_LIKE_PATTERN_LENGTH=50000',),
 ('MAX_MMAP_SIZE=0x7fff0000',),
 ('MAX_PAGE_COUNT=1073741823',),
 ('MAX_PAGE_SIZE=65536',),
 ('MAX_SQL_LENGTH=1000000000',),
 ('MAX_TRIGGER_DEPTH=1000',),
 ('MAX_VARIABLE_NUMBER=250000',),
 ('MAX_VDBE_OP=250000000',),
 ('MAX_WORKER_THREADS=8',),
 ('MUTEX_PTHREADS',),
 ('SECURE_DELETE',),
 ('SYSTEM_MALLOC',),
 ('TEMP_STORE=1',),
 ('THREADSAFE=1',)]

by the way, the breaking change should be checked at https://github.com/hideaki-t/sqlite-fts-python/blob/2808e9165d26e56e869fd633641fd29c2adce6f1/sqlitefts/fts5.py#L146-L148

hideaki-t commented 1 year ago

I could reproduce it on Linux too. it seems it happens when SQLite is static? linked to _sqlite3.so I will check what is wrong.

$ ldd ~/.pyenv/versions/3.10.8/lib/python3.10/lib-dynload/_sqlite3.cpython-310-x86_64-linux-gnu.so
        linux-vdso.so.1 (0x00007ffe2ad40000)
        libc.so.6 => /usr/lib/libc.so.6 (0x00007f92bd577000)
        /usr/lib64/ld-linux-x86-64.so.2 (0x00007f92bd8cd000)
$ ldd /usr/lib/python3.10/lib-dynload/_sqlite3.cpython-310-x86_64-linux-gnu.so
        linux-vdso.so.1 (0x00007ffdd8bf6000)
        libsqlite3.so.0 => /usr/lib/libsqlite3.so.0 (0x00007f012aa5e000)
        libc.so.6 => /usr/lib/libc.so.6 (0x00007f012a877000)
        libm.so.6 => /usr/lib/libm.so.6 (0x00007f012a78f000)
        /usr/lib64/ld-linux-x86-64.so.2 (0x00007f012abdd000)
macabrus commented 1 year ago

Sorry missed notifications. I will look into this asap.

>>> import sqlite3
>>> sqlite3.connect(':memory:').execute('select sqlite_version()').fetchone()
('3.39.4',)
>>> import pprint
>>> pprint.pprint(sqlite3.connect(':memory:').execute('pragma compile_options').fetchall())
[('ATOMIC_INTRINSICS=1',),
 ('COMPILER=clang-14.0.0',),
 ('DEFAULT_AUTOVACUUM',),
 ('DEFAULT_CACHE_SIZE=-2000',),
 ('DEFAULT_FILE_FORMAT=4',),
 ('DEFAULT_JOURNAL_SIZE_LIMIT=-1',),
 ('DEFAULT_MMAP_SIZE=0',),
 ('DEFAULT_PAGE_SIZE=4096',),
 ('DEFAULT_PCACHE_INITSZ=20',),
 ('DEFAULT_RECURSIVE_TRIGGERS',),
 ('DEFAULT_SECTOR_SIZE=4096',),
 ('DEFAULT_SYNCHRONOUS=2',),
 ('DEFAULT_WAL_AUTOCHECKPOINT=1000',),
 ('DEFAULT_WAL_SYNCHRONOUS=2',),
 ('DEFAULT_WORKER_THREADS=0',),
 ('ENABLE_COLUMN_METADATA',),
 ('ENABLE_FTS3',),
 ('ENABLE_FTS3_PARENTHESIS',),
 ('ENABLE_FTS4',),
 ('ENABLE_FTS5',),
 ('ENABLE_GEOPOLY',),
 ('ENABLE_MATH_FUNCTIONS',),
 ('ENABLE_PREUPDATE_HOOK',),
 ('ENABLE_RTREE',),
 ('ENABLE_SESSION',),
 ('MALLOC_SOFT_LIMIT=1024',),
 ('MAX_ATTACHED=10',),
 ('MAX_COLUMN=2000',),
 ('MAX_COMPOUND_SELECT=500',),
 ('MAX_DEFAULT_PAGE_SIZE=8192',),
 ('MAX_EXPR_DEPTH=1000',),
 ('MAX_FUNCTION_ARG=127',),
 ('MAX_LENGTH=1000000000',),
 ('MAX_LIKE_PATTERN_LENGTH=50000',),
 ('MAX_MMAP_SIZE=0x7fff0000',),
 ('MAX_PAGE_COUNT=1073741823',),
 ('MAX_PAGE_SIZE=65536',),
 ('MAX_SQL_LENGTH=1000000000',),
 ('MAX_TRIGGER_DEPTH=1000',),
 ('MAX_VARIABLE_NUMBER=250000',),
 ('MAX_VDBE_OP=250000000',),
 ('MAX_WORKER_THREADS=8',),
 ('MUTEX_PTHREADS',),
 ('SYSTEM_MALLOC',),
 ('TEMP_STORE=1',),
 ('THREADSAFE=1',)]

Since I posted this issue I tried upgrading system's sqlite package (brew's sqlite3) to latest published version but the issue still persists.

I also tried building pysqlite3 package from custom amalgamation source but couldn't make it work on MacOS. Do you also use this library to change sqlite build independent from python distribution? I see that following steps in README also builds statically linked binary which, as you mentioned, is maybe a problem.

EDIT: I see this commit now that you posted. Will try building sqlite from source with your options. I struggled to find good steps for building custom dynamically linked sqlite on mac os.

hideaki-t commented 1 year ago

Hi @macabrus,

I managed make the pipeline pass (see https://github.com/hideaki-t/sqlite-fts-python/actions/runs/3710604666 ) can you try https://github.com/hideaki-t/sqlite-fts-python/tree/GH-30 ?

I've never used pysqlite3. as you and I noticed, it seems static linking is the source of the problem. I found that when I was debbuging sqlitefts and APSW combination, but I totally forgot about it.

If your python was from Python.org, it seems sqlite3 is statically linked to _sqlite3 module. so separately installed/upgraded SQLite may not be used. (but I don't have Mac environment, not sure...)