lnx-search / lnx

⚡ Insanely fast, 🌟 Feature-rich searching. lnx is the adaptable, typo tollerant deployment of the tantivy search engine.
https://lnx.rs
MIT License
1.23k stars 46 forks source link

400 status code / invalid query when using the `"` character #131

Open mosheduminer opened 1 year ago

mosheduminer commented 1 year ago

Hitting the indexes/{index}/search endpoint with a query with a " character inside:

{
  "query": {
    "normal": {
      "ctx": "test\""
    }
  },
}

results in the response

{"status":400,"data":"invalid query: SyntaxError(\"test\\\"\")"}

Maybe there's a decoding bug on my end? If so, it may the HTTP library I'm using. I'm using the docker image.

ChillFish8 commented 1 year ago

Hello! Sorry for the long response, I didn't see the notification :)

The issue is because your query is expecting a closing ", the parser will try treat it as a phrase query so you need "hello world" to match exactly hello world but just " on its own isn't a valid query syntax (see https://docs.rs/tantivy/latest/tantivy/query/struct.QueryParser.html)

mosheduminer commented 1 year ago

Hi @ChillFish8! Thanks for the response. To clarify, does that mean there is no way to match text with quotes?

I'm asking because I have many texts where " is in middle of a word, and this is expected for the texts I am dealing with (they are used to indicate that the word is a contraction of multiple words, similar to how ' is used in English for words like didn't).

mosheduminer commented 1 year ago

I guess I should open an issue requesting the ability to escape quotes in the tantivy repo?

ChillFish8 commented 1 year ago

Thanks for the response. To clarify, does that mean there is no way to match text with quotes?

So technically you could support it in the parser, but it won't behave how you expect it to.

Under the hood words like that will be split up so say I had didn't or test"ing they'll be split into didn, t and test, ing The tokenizer will remove any special characters like that.

ChillFish8 commented 1 year ago

If you're looking for a specific word and don't want that behaviour you'd need to use the string field type which doesn't do any tokenizing and then match for the entire value using a term query.