Originally posted by **jankovicsandras** October 7, 2024
### What happens?
I've read these docs:
https://docs.paradedb.com/documentation/advanced/overview
https://docs.paradedb.com/documentation/full-text/term
https://docs.paradedb.com/documentation/full-text/phrase
but it's unclear to me how one should search for a real life question with BM25 (Bag Of Words, not exact phrase matching). I'm testing on a [wordpress-forum related dataset](https://huggingface.co/datasets/mteb/cqadupstack-wordpress/resolve/main/corpus.jsonl), here are some examples.
Input question: ```Add filename to attachment page url```
How should I turn this question to a BM25 search object?
Because the straightforward
```sql
SELECT id, doctext FROM paradedbbm25.search( query => paradedb.parse('Add\ filename\ to\ attachment\ page\ url'), limit_rows => 5 );
```
will not find documents with words ```filename``` or ```attachment```; this would only match the exact phrase ```Add filename to attachment page url``` .
So I need to tokenize the question, but I ran into issues with apostrophe escaping:
Input question: ```Custom Menu in Admin doesn't change menu in browser```
I made a simple split-on-whitespace tokenizer that escapes the special characters https://docs.paradedb.com/documentation/full-text/term#special-characters (Note: apostrophe ' is not on the list) , but
```sql
SELECT id, doctext FROM paradedbbm25.search( 'doctext:(Custom Menu in Admin doesn''t change menu in browser)', limit_rows => 5 );
```
results in ParseError/SyntaxError, and
```sql
SELECT id, doctext FROM paradedbbm25.search( 'doctext:(Custom Menu in Admin doesn\'t change menu in browser)', limit_rows => 5 );
```
results in syntax error at or near "t" (in doesn\'t).
How should I tokenize a question containing apostrophe to be used with BM25 search?
How could I use the same tokenizer that was used in paradedb.create_bm25() ? (because if the question is tokenized with a different method than create_bm25(), then there's a risk of missing relevant words in the bag-of-words model and losing accuracy)
### To Reproduce
```
CALL paradedb.create_bm25(...
```
```sql
SELECT id, doctext FROM paradedbbm25.search( 'doctext:(Custom Menu in Admin doesn''t change menu in browser)', limit_rows => 5 );
```
results in ParseError/SyntaxError, and
```sql
SELECT id, doctext FROM paradedbbm25.search( 'doctext:(Custom Menu in Admin doesn\'t change menu in browser)', limit_rows => 5 );
```
results in syntax error at or near "t" (in doesn\'t).
### OS:
Ubuntu LTS in Colab
### ParadeDB Version:
releases/download/v0.10.2/postgresql-16-pg-search_0.10.2-1PARADEDB-jammy_amd64.deb
### Are you using ParadeDB Docker, Helm, or the extension(s) standalone?
ParadeDB pg_search Extension
### Full Name:
András Jankovics
### Affiliation:
András Jankovics
### Did you include all relevant data sets for reproducing the issue?
Yes
### Did you include the code required to reproduce the issue?
- [X] Yes, I have
### Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?
- [X] Yes, I have
Discussed in https://github.com/orgs/paradedb/discussions/1752