paradedb / paradedb

Postgres for Search and Analytics
https://paradedb.com
GNU Affero General Public License v3.0
6.07k stars 176 forks source link

Issue with tokenization with `'` #1759

Open philippemnoel opened 1 week ago

philippemnoel commented 1 week ago

Discussed in https://github.com/orgs/paradedb/discussions/1752

Originally posted by **jankovicsandras** October 7, 2024 ### What happens? I've read these docs: https://docs.paradedb.com/documentation/advanced/overview https://docs.paradedb.com/documentation/full-text/term https://docs.paradedb.com/documentation/full-text/phrase but it's unclear to me how one should search for a real life question with BM25 (Bag Of Words, not exact phrase matching). I'm testing on a [wordpress-forum related dataset](https://huggingface.co/datasets/mteb/cqadupstack-wordpress/resolve/main/corpus.jsonl), here are some examples. Input question: ```Add filename to attachment page url``` How should I turn this question to a BM25 search object? Because the straightforward ```sql SELECT id, doctext FROM paradedbbm25.search( query => paradedb.parse('Add\ filename\ to\ attachment\ page\ url'), limit_rows => 5 ); ``` will not find documents with words ```filename``` or ```attachment```; this would only match the exact phrase ```Add filename to attachment page url``` . So I need to tokenize the question, but I ran into issues with apostrophe escaping: Input question: ```Custom Menu in Admin doesn't change menu in browser``` I made a simple split-on-whitespace tokenizer that escapes the special characters https://docs.paradedb.com/documentation/full-text/term#special-characters (Note: apostrophe ' is not on the list) , but ```sql SELECT id, doctext FROM paradedbbm25.search( 'doctext:(Custom Menu in Admin doesn''t change menu in browser)', limit_rows => 5 ); ``` results in ParseError/SyntaxError, and ```sql SELECT id, doctext FROM paradedbbm25.search( 'doctext:(Custom Menu in Admin doesn\'t change menu in browser)', limit_rows => 5 ); ``` results in syntax error at or near "t" (in doesn\'t). How should I tokenize a question containing apostrophe to be used with BM25 search? How could I use the same tokenizer that was used in paradedb.create_bm25() ? (because if the question is tokenized with a different method than create_bm25(), then there's a risk of missing relevant words in the bag-of-words model and losing accuracy) ### To Reproduce ``` CALL paradedb.create_bm25(... ``` ```sql SELECT id, doctext FROM paradedbbm25.search( 'doctext:(Custom Menu in Admin doesn''t change menu in browser)', limit_rows => 5 ); ``` results in ParseError/SyntaxError, and ```sql SELECT id, doctext FROM paradedbbm25.search( 'doctext:(Custom Menu in Admin doesn\'t change menu in browser)', limit_rows => 5 ); ``` results in syntax error at or near "t" (in doesn\'t). ### OS: Ubuntu LTS in Colab ### ParadeDB Version: releases/download/v0.10.2/postgresql-16-pg-search_0.10.2-1PARADEDB-jammy_amd64.deb ### Are you using ParadeDB Docker, Helm, or the extension(s) standalone? ParadeDB pg_search Extension ### Full Name: András Jankovics ### Affiliation: András Jankovics ### Did you include all relevant data sets for reproducing the issue? Yes ### Did you include the code required to reproduce the issue? - [X] Yes, I have ### Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue? - [X] Yes, I have
philippemnoel commented 1 week ago

check #1631