jameslittle230 / stork

🔎 Impossibly fast web search, made for static sites.
https://stork-search.net
Apache License 2.0
2.73k stars 56 forks source link

Main thread panic in stork search "not a char boundary" #356

Open karlwilcox opened 1 year ago

karlwilcox commented 1 year ago

With the following input file (named gallery.toml)

[output]
    displayed_results_count = 31
[input]
    url_prefix = "/gallery/"
    frontmatter_handling = "Omit"
    stemming = "None"
    minimum_indexed_substring_length = 4
files = [
{ url = "010695", title = "(Untitled)", contents = "caption saint–aubin–fosse–louvain  then  gules a chevron argent between 3 eagles or. caption  saint–berthevin  then  (1999) per chief per pale gules and argent; and sable  an eagle arg beaked and membered or in dexter side  taller a lion sable crowned  gu armed and langued gu in sinister side  shorter shorter a demi  lion arg in chief taller taller  lower. caption saint–berthevin–la–tanniere  then  (1999) or a chevron gules 2 eagles in chief azure  a tree eradicated vert in middle  %base higher. caption  saint–charles–la–foret  then  gules a carbuncle or. caption 'saint–denis–d'anjou'  ", filetype="PlainText" },
]

We run the command:

stork build --input gallery.toml --output gallery.st

And then try a command line search for a known hit, e.g.

stork search --format json --index gallery.st --query "azure"

We get the message:

thread 'main' panicked at 'byte index 540 is not a char boundary; it is inside '–' (bytes 539..542) of `caption saint–aubin–fosse–louvain  then  gules a chevron argent between 3 eagles or. caption  saint–berthevin  then  (1999) per chief per pale gules and argent; and sable  an eagle arg beaked and membered or in dexter side  taller a lion sable crow`[...]', stork-lib/src/index_v4/search/excerpt_grouping.rs:158:19
stack backtrace:
   0: rust_begin_unwind
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/panicking.rs:64:14
   2: core::str::slice_error_fail_rt
   3: core::str::slice_error_fail
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/str/mod.rs:86:9
   4: stork_lib::index_v4::search::render_search_values
   5: stork::main

(Byte 540 is just before the final "gules" in the content string)

Other information:

karlw@DESKTOP-9DUHI21:~/Documents/ds-web/tools$ stork --version
Stork 2.0.0-beta.2
karlw@DESKTOP-9DUHI21:~/Documents/ds-web/tools$ file /usr/local/bin/stork
/usr/local/bin/stork: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=2b673e767e82fea952b7220d047e5fe187d91b27, for GNU/Linux 3.2.0, with debug_info, not stripped
karlw@DESKTOP-9DUHI21:~/Documents/ds-web/tools$ uname -a
Linux DESKTOP-9DUHI21 5.15.90.1-microsoft-standard-WSL2 #1 SMP Fri Jan 27 02:56:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

(So this is Ubuntu Linux running under WSL2, although I can reproduce on a native Ubuntu installation also)

Please let me know if you need anything else. Hope this is useful!

karlwilcox commented 1 year ago

Further investigation suggests that the things that look like '-' are not ASCII, removing them solves the problem so this is likely something related to character mapping.

karlwilcox commented 1 year ago

It is \u2013 that seems to cause the problem.

karlwilcox commented 1 year ago

Actually everything non-ASCII in the input file seems to cause a problem with the command line search hits. In PHP,

iconv("UTF-8", "ASCII//TRANSLIT", $content);

Fixes the problem

This may even be documented somewhere so I'll shut up now...