quickwit-oss / quickwit

Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
https://quickwit.io
Other
7.05k stars 293 forks source link

Dynamic fields term set query not working #4945

Open camerondavison opened 2 months ago

camerondavison commented 2 months ago

I am able to get

data.event_type:StandardMgaQuoteIssued OR data.event_type:QuoteIssued

query to work, but I cannot get a term set query to work, like

data.event_type:IN [StandardMgaQuoteIssued QuoteIssued]

This is my kafka ingestion configuration

version: 0.6
index_id: events

doc_mapping:
  field_mappings:
    - name: id
      type: text
      tokenizer: raw
...
    - name: data
      type: json
      tokenizer: default
    - name: occurred_at
      type: datetime
      fast: true
      input_formats:
        - rfc3339
        - "%Y-%m-%dT%H:%M:%S.%f"
      precision: seconds
  timestamp_field: occurred_at

indexing_settings:
  commit_timeout_secs: 10
{
  "build": {
    "build_date": "2024-03-29T16:35:13Z",
    "build_profile": "release",
    "build_target": "x86_64-unknown-linux-gnu",
    "cargo_pkg_version": "0.8.1",
    "commit_date": "2024-03-29T14:09:41Z",
    "commit_hash": "e6c53967f8e57401d93bcc555d361dad69bd4ece",
    "commit_short_hash": "e6c5396",
    "commit_tags": [
      "v0.8.1"
    ],
    "version": "v0.8.1"
  },
  "runtime": {
    "num_cpus_logical": 4,
    "num_cpus_physical": 2,
    "num_threads_blocking": 3,
    "num_threads_non_blocking": 1
  }
}

Im not sure if it may be a feature or a bug, based on the dynamic json fields that it is trying to use.

fmassot commented 2 months ago

Thanks for the report @camerondavison

@trinity-1686a, can you look at this?

trinity-1686a commented 1 month ago

it looks like this is an issue of TermSetQuery not going through the tokenizer, and not lowercasing the values. @camerondavison can you confirm you use the default tokenizer, and that data.event_type:IN [standardmgaquoteissued quoteissued] returns the result you expect?

Note that term set query with only a few elements is often less efficient than ORing a couple of term queries, so using a term set query for a set of 2 elements is not advisable.

camerondavison commented 1 month ago

Yes that worked.

Good to know about the OR v term set query. That seems a little counter intuitive TBH thanks.

trinity-1686a commented 1 month ago

sets get efficient when you start to have many terms. Computation-wise, both should be close, but network-wise, multiple term queries will cause multiple small downloads, while term set will download a big chunk of data. When you have many terms in you set, one large fetch is more efficient than thousands of small fetches, but when you need only a few terms, doing these small fetches is faster. In the future, we may improve term set queries so they are more efficient when only a few terms can be requested.