We need the ability to coerce values into specified types. This is, in a way, a schema definition, guaranteeing that incoming data will be coerced into the specified types when it is output. This is a precursor to supporting columnar sinks, like Big Query, Parquet, or ORC, where typed columns are required when writing data.
That said, we've already laid the ground work for coercion within the regex_parser, grok_parser, and tokenizer transforms. For this, we simply need to take that same functionality but allow it to be specified as a standalone transform.
Example
[sources.in]
type = "stdin"
[transforms.json]
inputs = ["in"]
type = "json_parser"
[transforms.coercer]
inputs = ["json"]
type = "coercer"
[transforms.coercer.types]
timestamp = "timestamp|%F"
message = "string"
status = "int"
bytes = "int"
duration = "float"
is_log = "bool"
[sinks.out]
inputs = ["coercer"]
type = "console"
[ ] Support a new typesTOML table that is nested under the root coercer table. Ex: transforms.<transform-id>.types
[ ] If possible, log a warning if there are fields specified in the config that are not captured in the regex.
[ ] Log an error level message and exit if a field type is specified and is not supported. Ex: "Type foo is not supported for the field_name field, it must be one of string, int, bool, float, etc."
[ ] Log a debug level message if a field value cannot be coerced into the specified type. This includes timestamp parsing as well.
[ ] Any field not specified in the type table, but is capture in the regex, should follow the default behavior of being extracted as a string.
[ ] Drop the value if the value cannot be coerced into the specified type.
We need the ability to coerce values into specified types. This is, in a way, a schema definition, guaranteeing that incoming data will be coerced into the specified types when it is output. This is a precursor to supporting columnar sinks, like Big Query, Parquet, or ORC, where typed columns are required when writing data.
That said, we've already laid the ground work for coercion within the
regex_parser
,grok_parser
, andtokenizer
transforms. For this, we simply need to take that same functionality but allow it to be specified as a standalone transform.Example
Then, if I send the following data:
I should get the following coerced result:
Where
timestamp
is coerced into an actual timestamp type. It should be output as ISO8601 in theconsole
sink.Requirements
These requirements were taken directly from https://github.com/timberio/vector/issues/406
types
TOML table that is nested under the rootcoercer
table. Ex:transforms.<transform-id>.types
error
level message and exit if a field type is specified and is not supported. Ex: "Typefoo
is not supported for thefield_name
field, it must be one ofstring
,int
,bool
,float
, etc."debug
level message if a field value cannot be coerced into the specified type. This includes timestamp parsing as well.type
table, but is capture in the regex, should follow the default behavior of being extracted as a string.