Open ghost opened 3 years ago
So assuming you ran both insert into ftstable(ftstable) values ('optimize');
and Vacuum
I don't think there's any good solution to this sadly.
You can play with some indexing options such as adding stemming with tokenize='porter unicode61'
which should reduce the number of unique terms, and maybe stripping out stop words.
I've hit the same problem, and the issue is that without ordering by rank, SQLite will just return the first matching rows it finds, while with ranking it has to look at all of them (so it should be less worse if your query is more specific). Then it's made worse because the SQLite ranking algorithm uses the number of occurrences of a term in a document to do the ranking (like most FTS search), but as far as I understand from the documentation doesn't actually store that information (xInstCount
) but computes it by also storing all positions of a term per document and then counting through those.
You can't prevent it from storing all that position information in FTS5 without losing the ranking ability altogether (using detail=none
or =column
) and I don't know of a way to make it work any better without modifying SQLite itself.
I'm trying to make tantivy work in wasm the same way as SQLite right now which should be much more efficient for FTS.
Thank you for the thoughtful followup.
I did insert optimize
and ran vaccuum
but like you said, it appears SQlite has to iterate through the table in order to rank all the values. Unfortunate.
I tried to workaround the performance issue in another FTS5 table called article_title
by using the new "trigram" tokenizer instead "porter". This tokenizer enables LIKE
clauses / substring matching, which filters out rows that are missing keywords instead of including them like porter does. I still can't use "ORDER BY rank" but at least it only matches articles with all the keywords I've specified.
I'll try using "trigram" on the article
table and let you know how it goes.
Side note, have you seen other search-in-a-box solutions like https://github.com/typesense/typesense or https://github.com/meilisearch/MeiliSearch? They might interesting to look at if tantivy turns out to be a bit difficult.
I'm trying to make tantivy work in wasm the same way as SQLite right now which should be much more efficient for FTS.
I would absolutely love this. I'm about to abuse sql.js-httpvfs to add search to my gh-pages hosted log archive, and I would be delighted to switch that out for tantivy. Anywhere I can follow along and/or donate to help out?
Edit: I see there's https://github.com/phiresky/tantivy-wasm, which unfortunately has not quite enough information for me to get it working, but I'll keep poking at it.
Anywhere I can follow along and/or donate to help out?
News about that will probably appear here: https://github.com/tantivy-search/tantivy/pull/1067
It already works and doesn't require much special setup (running yarn; yarn build
in the repo you linked should give you the compiled version, which is usable with a normal tantivy database created from tantivy cli 0.14), but it's fairly hacky and depends on my outdated fork of tantivy.
Do you still see tantivy as the future for this type of feature? It doesn't look like you've touched it in over a year. Did tantivy end up not being as fast or reliable as sqlite?
SQLite now official supports WASM, so may be, it might be much more efficient than sql.js. You can read more about it here and here
I've switched to pagefind. Much simpler and it supports everything sql.js aims to someday support.
Thank you for writing this library. It's really neat!
I'm currently trying to debug a performance issue with querying a FTS5 virtual table. This table is populated with 6 million rows of articles that have title and text columns. The query I'm using on this table looks like:
select * from articles where articles match ? order by rank LIMIT 7
Right now, when that query is executed by the sql.js-httpvfs worker, it fires a continuous stream of slew network requests, taking 20 seconds plus to download tens of MB of data.
I found that removing the "order by rank" clause reduces the latency of the query to a few seconds and the size of the query to less than 1 MB of data. Unfortunately, the results are also much less useful; they're returned in alphabetical order and not in order of relevancy.
Here are the query plan comparisons between the two:
What can I do to speed this kind of query up without losing the relevancy that "ORDER BY rank" provides?