persistory / browserparrot-issues

Public issues, suggestions, and support for BrowserParrot
https://browserparrot.com/
5 stars 1 forks source link

Unable to search Japanese (and possibly other non-English) text #16

Open aonsager opened 2 years ago

aonsager commented 2 years ago

When typing a search query in Japanese I get zero search results. The text is visible in search results when I find the item through English queries, so I thought there may be some filter it's not getting through when parsing either the query or the results.

I understand that this may be very low-priority, so please handle as you see fit. Thanks!

en_query ja_query

iansinnott commented 2 years ago

Ah yes, thanks for pointing this out @aonsager . This is indeed the case and it's a problem. Not just Japanese, CJK scripts do not work currently.

This has to do with how the FTS system tokenizer [1]. This can be configured through and is in the backlog

[1] https://www.sqlite.org/fts5.html#tokenizers

iansinnott commented 1 year ago

Have been rewriting the backend for this to use an alternate search system. In my limited testing CJK works roughly as expected. In the case of Chinese there's no special handling for word separation so individual characters are treated as terms. Could be improved but definitely better than the status quo.

It's not exactly the same tool but here's the link: https://github.com/iansinnott/browser-gopher

CleanShot 2022-10-20 at 17 01 10@2x