As a first step towards supporting various schemas, we want to shift the event metadata into a Vector specific namespace. This will solve a number of awkward siutations where Vector's metadata clashes with the user's schema.
Examples
Transitioning from Logstash
Let's look at a simple example where a user is replacing Logstash with Vector. Vector would receive data from upstream beats over TCP in the following format:
Here we can see that the event is already enriched with metadata. When the event leaves the Vector socket source it'll be formatted as such:
{
"timestamp": "...",
"host": "...",
"message": "{\n\"message\": \"127.0.0.1 - - [11/Dec/2013:00:01:45 -0800] \"GET /xampp/status.php HTTP/1.1\" 200 3891 \"http://cadenza/xampp/navi.php\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:25.0) Gecko/20100101 Firefox/25.0\"\",\n\"@timestamp\": \"2013-12-11T08:01:45.000Z\",\n\"@version\": \"1\",\n\"host\": \"cadenza\",\n\"clientip\": \"127.0.0.1\",\n\"ident\": \"-\",\n\"auth\": \"-\",\n\"timestamp\": \"11/Dec/2013:00:01:45 -0800\",\n\"verb\": \"GET\",\n\"request\": \"/xampp/status.php\",\n\"httpversion\": \"1.1\",\n\"response\": \"200\",\n\"bytes\": \"3891\",\n\"referrer\": \"\"http://cadenza/xampp/navi.php\"\",\n\"agent\": \"\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:25.0) Gecko/20100101 Firefox/25.0\"\"\n}\n"
}
Right off the bat this is awkward since we've constructed an event that does not resemble what they are expecting. To solve this the user must run the event through the json_parser which would result in:
This is looking better, but notice we have a timestamp and @timestamp field with different values. When this event is encoded within a sink, we'll use the wrong timestamp.
This should be enough to demonstrate the awkwardness of this approach.
Proposals
Below are a couple of proposals that have been discussed. As part of the RFC you'll need to choose the best approach and propose it.
Vector metadata namespace
Instead of pulluting the user's namespace with Vector specific fields, timestamp and message in our example above, we could shove them in a Vector specific namespace:
This strikes me as cleaner and much more flexible:
We retain the raw data which I'm sure is useful in some use cases.
We know if the event is explicitly structured or not. In the tcp -> tcp pipeline we can cleanly pass this data through.
Alternatives & Prior Art
It is worth exploring Splunk forwarder's approach to this problem. From my understanding, they use a root-level _raw key that gets removed once the data is parsed. You can see this in their docs:
If events do not have a _raw field, they'll be serialized to JSON prior to being sent.
Outstanding questions
Even though we solved the problem of polluting the user's namespace, how would Vector know which timestamp key to use within each sink? Could the user tell us this through configuration? Hint: we'll need some sort of schema knowledge that tells us where to look for fields.
How do we prevent the _vector metadata key from being encoded? Ex: we could default the sink-level encoding.except_fields to ["_vector"], or we could hard code ignoring this when we encode events.
What should we do with all of the *_key options across our sources and sinks?
As a first step towards supporting various schemas, we want to shift the event metadata into a Vector specific namespace. This will solve a number of awkward siutations where Vector's metadata clashes with the user's schema.
Examples
Transitioning from Logstash
Let's look at a simple example where a user is replacing Logstash with Vector. Vector would receive data from upstream beats over TCP in the following format:
Here we can see that the event is already enriched with metadata. When the event leaves the Vector
socket
source it'll be formatted as such:Right off the bat this is awkward since we've constructed an event that does not resemble what they are expecting. To solve this the user must run the event through the
json_parser
which would result in:This is looking better, but notice we have a
timestamp
and@timestamp
field with different values. When this event is encoded within a sink, we'll use the wrong timestamp.This should be enough to demonstrate the awkwardness of this approach.
Proposals
Below are a couple of proposals that have been discussed. As part of the RFC you'll need to choose the best approach and propose it.
Vector metadata namespace
Instead of pulluting the user's namespace with Vector specific fields,
timestamp
andmessage
in our example above, we could shove them in a Vector specific namespace:This solves the key clashing issues in that we are no longer polluting the user's namespace.
raw
metadata keyIn addition the the above, we could also move raw data into a specific
raw
key:This strikes me as cleaner and much more flexible:
tcp
->tcp
pipeline we can cleanly pass this data through.Alternatives & Prior Art
It is worth exploring Splunk forwarder's approach to this problem. From my understanding, they use a root-level
_raw
key that gets removed once the data is parsed. You can see this in their docs:Outstanding questions
timestamp
key to use within each sink? Could the user tell us this through configuration? Hint: we'll need some sort of schema knowledge that tells us where to look for fields._vector
metadata key from being encoded? Ex: we could default the sink-levelencoding.except_fields
to["_vector"]
, or we could hard code ignoring this when we encode events.*_key
options across our sources and sinks?