hideaki-t / sqlite-fts-python

A Python binding of SQLite Full Text Search Tokenizer
MIT License
46 stars 11 forks source link

Use cffi seemed will call some error in complex gc program. #22

Open svjack opened 3 years ago

svjack commented 3 years ago

Hi , i use this project with sqlite_fts4 to custom tokenizer and ranking function in a engine-map of sqlite. And the mainly interaction between sqlite and python is the register_functions defined in sqlite_fts4 and your register_tokenizer which plugin the code as your example say. And i try my chinese tokenizer locally with one engine as follows: ` import jieba class JiebaTokenizer(fts.Tokenizer): def tokenize(self, text): for t, s, e in jieba.tokenize(text): l = len(t.encode("utf-8")) p = len(text[:s].encode("utf-8")) yield t, p, p + l

contents = [("これは日本語で書かれています",), (" これは 日本語の文章を 全文検索するテストです",), ("新兴铸管",)]

tkj = fts.make_tokenizer_module(JiebaTokenizer())

conn.execute("CREATE VIRTUAL TABLE fts USING FTS4(tokenize={})".format("jieba_tokenizer"))

c = conn r = c.executemany("INSERT INTO fts VALUES(?)", contents)

r = c.execute("SELECT * FROM fts").fetchall()

r = c.execute("SELECT * FROM fts WHERE fts MATCH '新兴'").fetchall() ` the last r produce the success conclusion.

My problem is that when i use it in a dictionary of engine, key as name, value as engine, with some complex interaction (register) It yield the following error in gdb:

Program received signal SIGSEGV, Segmentation fault. 0x0000555555690253 in delete_garbage.isra.26 ( old=0x5555558c7540 <_PyRuntime+416>, collectable=0x7fffffffda30) at /tmp/build/80754af9/python_1599203911753/work/Modules/gcmodule.c:948 948 /tmp/build/80754af9/python_1599203911753/work/Modules/gcmodule.c: No such file or directory.

this seems a error caused by cffi, relate questions are: https://stackoverflow.com/questions/43079945/why-is-there-a-segmentation-fault-with-this-code https://stackoverflow.com/questions/41577144/how-to-solve-a-sqlite-fts-segmentation-fault-in-python

some says cffi have some problem in nest objects, and say if replace cffi by pybind11, this kind of problem can be solved, can you try to give me some suggestions ? And if you require, i will upload the whole code to make the error reproduce. Thank you.

hideaki-t commented 3 years ago

Thank you for the report and sharing the idea.

You're right. Using cffi for nested object and/or callbacks are not super easy.

I didn't know pybind11, but it seems it needs a C compiler unlike cffi's ABI mode. I will research more about pybind11, and also will restart to check DragonFFI.

Thank you, Hideaki

hideaki-t commented 3 years ago

do you have a simple script to reproduce the issue?

before start making any change including switching FFI library, I'd like to reproduce the issue on my end to know what triggered your problem.

Thanks, Hideaki

svjack commented 3 years ago

do you have a simple script to reproduce the issue?

before start making any change including switching FFI library, I'd like to reproduce the issue on my end to know what triggered your problem.

Thanks, Hideaki

import sqlite3
import sqlite_utils
import sqlitefts as fts
import jieba
import pandas as pd

conn = sqlite3.connect("zvt-script/test.db")

class JiebaTokenizer(fts.Tokenizer):
    def tokenize(self, text):
        for t, p, p_l in jieba.tokenize(text):
            yield t, p, p_l

tk = fts.make_tokenizer_module(JiebaTokenizer())
fts.register_tokenizer(conn, "jieba_tokenizer", tk)

pd_dict = {"name": ["平安银行", "万科A", "国农科技"]}
pd.DataFrame.from_dict(pd_dict).to_sql("df0", conn, if_exists = "replace")
db = sqlite_utils.Database(conn)
db["df0"].enable_fts(["name"], fts_version="FTS4", tokenize="jieba_tokenizer")
rows = list(db["df0"].search("银行"))

Above is a simple example to reproduce the error with sqlite_utils. Can you give me a response, if you see this reply ?