vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.49k stars 1.53k forks source link

Schema metadata RFC #4599

Closed binarylogic closed 2 years ago

binarylogic commented 3 years ago

As a first step towards supporting various schemas, we want to shift the event metadata into a Vector specific namespace. This will solve a number of awkward siutations where Vector's metadata clashes with the user's schema.

Examples

Transitioning from Logstash

Let's look at a simple example where a user is replacing Logstash with Vector. Vector would receive data from upstream beats over TCP in the following format:

{
        "message": "127.0.0.1 - - [11/Dec/2013:00:01:45 -0800] \"GET /xampp/status.php HTTP/1.1\" 200 3891 \"http://cadenza/xampp/navi.php\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:25.0) Gecko/20100101 Firefox/25.0\"",
     "@timestamp": "2013-12-11T08:01:45.000Z",
       "@version": "1",
           "host": "cadenza",
       "clientip": "127.0.0.1",
          "ident": "-",
           "auth": "-",
      "timestamp": "11/Dec/2013:00:01:45 -0800",
           "verb": "GET",
        "request": "/xampp/status.php",
    "httpversion": "1.1",
       "response": "200",
          "bytes": "3891",
       "referrer": "\"http://cadenza/xampp/navi.php\"",
          "agent": "\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:25.0) Gecko/20100101 Firefox/25.0\""
}

Here we can see that the event is already enriched with metadata. When the event leaves the Vector socket source it'll be formatted as such:

{
  "timestamp": "...",
  "host": "...",
  "message": "{\n\"message\": \"127.0.0.1 - - [11/Dec/2013:00:01:45 -0800] \"GET /xampp/status.php HTTP/1.1\" 200 3891 \"http://cadenza/xampp/navi.php\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:25.0) Gecko/20100101 Firefox/25.0\"\",\n\"@timestamp\": \"2013-12-11T08:01:45.000Z\",\n\"@version\": \"1\",\n\"host\": \"cadenza\",\n\"clientip\": \"127.0.0.1\",\n\"ident\": \"-\",\n\"auth\": \"-\",\n\"timestamp\": \"11/Dec/2013:00:01:45 -0800\",\n\"verb\": \"GET\",\n\"request\": \"/xampp/status.php\",\n\"httpversion\": \"1.1\",\n\"response\": \"200\",\n\"bytes\": \"3891\",\n\"referrer\": \"\"http://cadenza/xampp/navi.php\"\",\n\"agent\": \"\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:25.0) Gecko/20100101 Firefox/25.0\"\"\n}\n"
}

Right off the bat this is awkward since we've constructed an event that does not resemble what they are expecting. To solve this the user must run the event through the json_parser which would result in:

{
        "timestamp": "...",
        "message": "127.0.0.1 - - [11/Dec/2013:00:01:45 -0800] \"GET /xampp/status.php HTTP/1.1\" 200 3891 \"http://cadenza/xampp/navi.php\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:25.0) Gecko/20100101 Firefox/25.0\"",
     "@timestamp": "2013-12-11T08:01:45.000Z",
       "@version": "1",
           "host": "cadenza",
       "clientip": "127.0.0.1",
          "ident": "-",
           "auth": "-",
      "timestamp": "11/Dec/2013:00:01:45 -0800",
           "verb": "GET",
        "request": "/xampp/status.php",
    "httpversion": "1.1",
       "response": "200",
          "bytes": "3891",
       "referrer": "\"http://cadenza/xampp/navi.php\"",
          "agent": "\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:25.0) Gecko/20100101 Firefox/25.0\""
}

This is looking better, but notice we have a timestamp and @timestamp field with different values. When this event is encoded within a sink, we'll use the wrong timestamp.

This should be enough to demonstrate the awkwardness of this approach.

Proposals

Below are a couple of proposals that have been discussed. As part of the RFC you'll need to choose the best approach and propose it.

Vector metadata namespace

Instead of pulluting the user's namespace with Vector specific fields, timestamp and message in our example above, we could shove them in a Vector specific namespace:

{
   "_vector": {
     "timestamp": "...",
     "host": "..."
   },
   "message": "..."
}

This solves the key clashing issues in that we are no longer polluting the user's namespace.

raw metadata key

In addition the the above, we could also move raw data into a specific raw key:

{
   "_vector": {
     "timestamp": "...",
     "host": "...",
     "raw": "..."
   }
}

This strikes me as cleaner and much more flexible:

  1. We retain the raw data which I'm sure is useful in some use cases.
  2. We know if the event is explicitly structured or not. In the tcp -> tcp pipeline we can cleanly pass this data through.

Alternatives & Prior Art

It is worth exploring Splunk forwarder's approach to this problem. From my understanding, they use a root-level _raw key that gets removed once the data is parsed. You can see this in their docs:

If events do not have a _raw field, they'll be serialized to JSON prior to being sent.

Outstanding questions

  1. Even though we solved the problem of polluting the user's namespace, how would Vector know which timestamp key to use within each sink? Could the user tell us this through configuration? Hint: we'll need some sort of schema knowledge that tells us where to look for fields.
  2. How do we prevent the _vector metadata key from being encoded? Ex: we could default the sink-level encoding.except_fields to ["_vector"], or we could hard code ignoring this when we encode events.
  3. What should we do with all of the *_key options across our sources and sinks?
  4. How do we preserve backward compatibility?
jszwedko commented 2 years ago

Closing in-lieu of https://github.com/vectordotdev/vector/issues/12187