vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
18.11k stars 1.6k forks source link

New `coercer` transform #405

Closed binarylogic closed 5 years ago

binarylogic commented 5 years ago

We need the ability to coerce values into specified types. This is, in a way, a schema definition, guaranteeing that incoming data will be coerced into the specified types when it is output. This is a precursor to supporting columnar sinks, like Big Query, Parquet, or ORC, where typed columns are required when writing data.

That said, we've already laid the ground work for coercion within the regex_parser, grok_parser, and tokenizer transforms. For this, we simply need to take that same functionality but allow it to be specified as a standalone transform.

Example

[sources.in]
  type = "stdin"

[transforms.json]
  inputs = ["in"]
  type = "json_parser"

[transforms.coercer]
  inputs = ["json"]
  type = "coercer"

  [transforms.coercer.types]
    timestamp = "timestamp|%F"
    message = "string"
    status = "int"
    bytes = "int"
    duration = "float"
    is_log = "bool"

[sinks.out]
  inputs = ["coercer"]
  type = "console"

Then, if I send the following data:

{"timestamp": "2019-06-12T21:40:43.833286Z", "message": "Hello world", "status": "200", "bytes": "100", "duration": "54.2", "is_log": "true"}

I should get the following coerced result:

{
  "timestamp": timestamp<2019-06-12T21:40:43.833286Z>,
  "message": "Hello world",
  "status": 200,
  "bytes": 100,
  "duration": 54.2,
  "is_log": true
}

Where timestamp is coerced into an actual timestamp type. It should be output as ISO8601 in the console sink.

Requirements

These requirements were taken directly from https://github.com/timberio/vector/issues/406

binarylogic commented 5 years ago

@bruceg I noticed you pushed up the branch for a docs fix. Where did you end up? Curious what's left on this.