tenzir / public-roadmap

The public roadmap of Tenzir
https://docs.tenzir.com/roadmap
4 stars 0 forks source link

String Dissection #41

Closed mavam closed 9 months ago

mavam commented 1 year ago

Tenzir has no mechanism to deal well with opaque strings that contain a lot of structure, such as URLs, domains, user agents, etc. We need functionality that transforms such strings into a record of values.

Elastic's logstash has a dissect filter for this purpose. (There's also grok, which is regex-based and slower, but can accommodate more input variation.) Here are some details on the dissect filter:

A new dissect pipeline operator takes a "dissect expression" to add a new (or replace the existing) record field. Here's an example using Elastic syntax:

%{name},%{addr1},%{addr2},%{addr3},%{city},%{zip}

Let this be the string value:

Jane Doe,4321 Fifth Avenue,,,New York,87432

Then the dissection should transform it into such a record:

{
  "name": "Jane Doe",
  "addr1": "4321 Fifth Avenue",
  "addr2": "",
  "addr3": "",
  "city": "New York",
  "zip": "87432"
}
### Definition of Done
- [x] Define the UX of the operator.
- [x] We have validated that this addressed our URL normalization use case.
- [x] Implement the `parse` operator
- [x] Implement the `kv` parser
- [x] Implement the `grok` parser
- [x] Implement the `time` parser
mavam commented 12 months ago

It's worthwhile taking a look what the Tremor folks did with their extractor abstraction. They have dissect, grok, kv, re and others.

Out of the scope of this issue but interesting would be a generic tremor operator.

mavam commented 11 months ago

We had another prospect ask for these capabilities today.

mavam commented 11 months ago

Someone asked me today how we can dissect Apache logs, so +1'ing this item.