quickwit-oss / quickwit

Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
https://quickwit.io
Other
8.22k stars 336 forks source link

Dynamic Mapping should not tokenize values #1829

Open machete-michael opened 2 years ago

machete-michael commented 2 years ago

I’m using dynamic mapping to ingest a JSON object with a field of an array of JSON objects. If the array element has a field with value that has -, _, #, etc. delimiters in it, e.g. a uuid, querying against this field will result in the error:

“SplitSearchError { error: \”Invalid query: The field ‘_dynamic’ does not have positions indexed\”…}”

Steps to reproduce (if applicable) Steps to reproduce the behavior:

  1. Set up index config to use dynamic mode
  2. Ingest an object with an array of objects with a field mapped to a uuid 3.
  3. Execute a term query by matching a uuid.4.

Expected behavior I should get a matching document as a response

Configuration: Please provide:

  1. Output of quickwit --version v0.3.1nightly
  2. The index_config.yaml version: 0 index_id: foo doc_mapping: mode: dynamic field_mappings: -name: id type: text tokenizer: raw
fmassot commented 2 years ago

Thanks @machete-michael for the report.

More info on this issue.

Here is a request made on a default dynamic mapping (see docs example) that shows the same error:

curl -XGET http://localhost:7280/api/v1/my_dynamic_index/search\?query\=cart.product_description:cherry-pi
{
  "InvalidQuery": "The field '_dynamic' does not have positions indexed"
}% 

Without the - character, everything works well. Somehow, adding - is triggering a phrase query. But the cause can come from something totally different (like something happening in tantivy generate_literals_for_json_object function. We need to investigate what's happening.

fulmicoton commented 2 years ago

The query parser identify the string to search for correctly. The default tokenizer splits it into several tokens ([cherry, py]) which triggers the phrase query.

Probably the right fix would be to emit an intersection query here, if position are not available instead of emitting a error.

fmassot commented 2 years ago

@machete-michael sorry for the long silence. A new eye on this issue made me think that you may be interested in a uuid friendly tokenizer.

We have open an issue on this: https://github.com/quickwit-oss/quickwit/issues/1143

There is a PR that is almost mergeable here too: https://github.com/quickwit-oss/quickwit/pull/1598

Is this something you are interested in?

machete-michael commented 2 years ago

Hi @fmassot,

Thank you for looking into this issue.

UUID friendly tokenizer may just solve the issue with values with dashes and not the issues with the other delimiters.

In any case, I’ve move on to other solutions and am not waiting for a fix.

Please feel free to close the issue.

PSeitz commented 1 year ago

The query parser identify the string to search for correctly. The default tokenizer splits it into several tokens ([cherry, py]) which triggers the phrase query.

Probably the right fix would be to emit an intersection query here, if position are not available instead of emitting a error.

Shouldn't we use the same tokenizer as set in the config for the field ("raw")

The PhraseQuery issue would still persist for fields that are tokenized. I'm not sure about an intersection query, since it may silently return wrong results.

peacand commented 1 year ago

Hi here,

I've just spawned a fresh install of Quickwit 0.5 and I've configured a very simple index with no fields mapping (pure dynamic mode). I'm ingesting JSON logs from Vector. In that configuration, I cannot search anything with characters "-",",",".",SPACE .. I get the error : Invalid query: The field '_dynamic' does not have positions indexed" 100% of time. It does not depend on the field I'm searching on. Most the fields I've tried are supposed to be simple string fields.

Examples:

If I search a term without these special chars it works with no problem. This issue is quite problematic because almost all the searches I would like to do fail 🙁 I've tried to delete my index and restart from scratch with no success

My index configuration :

version: 0.5
index_id: suricata
doc_mapping:
  mode: dynamic
indexing_settings:
  commit_timeout_secs: 10

Did I miss something on the setup/configuration ?

fulmicoton commented 1 year ago

The queries you are trying to run are so-called phrase queries (due to the quotation mark). They require to store the token positions to run... This is someting that is not enabled by default but you can enable it as follows.

version: 0.5
index_id: suricata
doc_mapping:
  mode: dynamic
  dynamic_mapping:
       record: position # default to basic
indexing_settings:
  commit_timeout_secs: 10
peacand commented 1 year ago

Thank you @fulmicoton ! I confirm it works perfectly !