phiresky / sql.js-httpvfs

Hosting read-only SQLite databases on static file hosters like Github Pages
Apache License 2.0
3.47k stars 104 forks source link

How do you speed up a "ORDER BY rank" FTS5 query? #10

Open ghost opened 3 years ago

ghost commented 3 years ago

Thank you for writing this library. It's really neat!

I'm currently trying to debug a performance issue with querying a FTS5 virtual table. This table is populated with 6 million rows of articles that have title and text columns. The query I'm using on this table looks like: select * from articles where articles match ? order by rank LIMIT 7

Right now, when that query is executed by the sql.js-httpvfs worker, it fires a continuous stream of slew network requests, taking 20 seconds plus to download tens of MB of data.

I found that removing the "order by rank" clause reduces the latency of the query to a few seconds and the size of the query to less than 1 MB of data. Unfortunately, the results are also much less useful; they're returned in alphabetical order and not in order of relevancy.

Here are the query plan comparisons between the two:

sqlite> explain query plan select title from articles where articles match "bagel" limit 1;
QUERY PLAN
`--SCAN TABLE articles VIRTUAL TABLE INDEX 0:M2
sqlite> explain query plan select title from articles where articles match "bagel" order by rank limit 1;
QUERY PLAN
`--SCAN TABLE articles VIRTUAL TABLE INDEX 32:M2

What can I do to speed this kind of query up without losing the relevancy that "ORDER BY rank" provides?

phiresky commented 3 years ago

So assuming you ran both insert into ftstable(ftstable) values ('optimize'); and Vacuum I don't think there's any good solution to this sadly.

You can play with some indexing options such as adding stemming with tokenize='porter unicode61' which should reduce the number of unique terms, and maybe stripping out stop words.

I've hit the same problem, and the issue is that without ordering by rank, SQLite will just return the first matching rows it finds, while with ranking it has to look at all of them (so it should be less worse if your query is more specific). Then it's made worse because the SQLite ranking algorithm uses the number of occurrences of a term in a document to do the ranking (like most FTS search), but as far as I understand from the documentation doesn't actually store that information (xInstCount) but computes it by also storing all positions of a term per document and then counting through those.

You can't prevent it from storing all that position information in FTS5 without losing the ranking ability altogether (using detail=none or =column) and I don't know of a way to make it work any better without modifying SQLite itself.

I'm trying to make tantivy work in wasm the same way as SQLite right now which should be much more efficient for FTS.

ghost commented 3 years ago

Thank you for the thoughtful followup.

I did insert optimize and ran vaccuum but like you said, it appears SQlite has to iterate through the table in order to rank all the values. Unfortunate.

I tried to workaround the performance issue in another FTS5 table called article_title by using the new "trigram" tokenizer instead "porter". This tokenizer enables LIKE clauses / substring matching, which filters out rows that are missing keywords instead of including them like porter does. I still can't use "ORDER BY rank" but at least it only matches articles with all the keywords I've specified.

I'll try using "trigram" on the article table and let you know how it goes.

Side note, have you seen other search-in-a-box solutions like https://github.com/typesense/typesense or https://github.com/meilisearch/MeiliSearch? They might interesting to look at if tantivy turns out to be a bit difficult.

bakkot commented 3 years ago

I'm trying to make tantivy work in wasm the same way as SQLite right now which should be much more efficient for FTS.

I would absolutely love this. I'm about to abuse sql.js-httpvfs to add search to my gh-pages hosted log archive, and I would be delighted to switch that out for tantivy. Anywhere I can follow along and/or donate to help out?

Edit: I see there's https://github.com/phiresky/tantivy-wasm, which unfortunately has not quite enough information for me to get it working, but I'll keep poking at it.

phiresky commented 3 years ago

Anywhere I can follow along and/or donate to help out?

News about that will probably appear here: https://github.com/tantivy-search/tantivy/pull/1067

It already works and doesn't require much special setup (running yarn; yarn build in the repo you linked should give you the compiled version, which is usable with a normal tantivy database created from tantivy cli 0.14), but it's fairly hacky and depends on my outdated fork of tantivy.

chrisspen commented 2 years ago

Do you still see tantivy as the future for this type of feature? It doesn't look like you've touched it in over a year. Did tantivy end up not being as fast or reliable as sqlite?

fawazahmed0 commented 1 year ago

SQLite now official supports WASM, so may be, it might be much more efficient than sql.js. You can read more about it here and here

chrisspen commented 1 year ago

I've switched to pagefind. Much simpler and it supports everything sql.js aims to someday support.