quickwit-oss / quickwit

Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
https://quickwit.io
Other
7.31k stars 298 forks source link

Provide alternatives to tokenizer drops based on string length. #5112

Open AndyMcVey opened 1 month ago

AndyMcVey commented 1 month ago

Some json log files have relatively long values embedded in key:value pairs, e.g. Ansible Tower, Azure DevOps Server and Github can all produce data where key:value is greater than 255 characters.

All current tokenizers will drop strings > 255 characters in length. This makes it difficult to import json data with key:value pairs where the value is relatively large compared to the key size.

It would be great if there were an option to do one or more of the following:

  1. Truncate instead of omitting the token. This could result in some odd behaviour, as parts of the token would be indexed, but not all.
  2. If the dataset is json and the token is a key:value pair, tokenize only the key (assuming the key < 255 characters). This would allow for existence tests to be performed on the key, perhaps allowing other search terms to work.
  3. Index the key of a key:value pair and calculate the checksum of the value, then store the checksum as the value. This does change the data somewhat but would permit retrieval of a specific value if the searcher can search for the checksum instead of the string.