vector-vic commented 4 years ago

A common need for Vector users is the ability to map data according to different schemas. This is a key requirement for Vector since it aims to be schema, standard, and vendor-neutral. In order to deliver on this claim, Vector must not only support a variety of schemas independently, but it must also assist in the interchange between them.

Use Cases

Transitioning to Vector

Schemas create very heavy lock-in. This is because most downstream systems depend on this schema. To name a few:

Alerts.
Graphs and dashboards.
Storages.
Humans.

Changing a schema can break all of these things which usually is not acceptable. To prevent this, Vector must adopt their current schema in a way that downstream dependencies do not notice.

Transitioning Vendors

The use case above illustrates the need for Vector to support a single schema at a time, but there are cases where a user would need to support multiple. For example, when transitioning vendors. Vector must not only support the "read" schema but also transform the data to an entirely new "write" schema.

For example, if a user is transitioning from Splunk to Elasticsearch, Vector must ingest the data under the Splunk Common Information Model and transform it to the Elastic Common Schema.

Automatic Dashboards, Alerts, & Insights

A benefit of using a vendor's agent is that it'll unlock automatic dashboards, alerts, and other features. This not only saves a considerable amount of time and effort, but you can effectively delegate the management of these things to your chosen vendor. For example, I assume that DataDog, and their community, continually improve their dashboards. In this case, it's very important that Vector can transparently adopt the DataDog schema so that DataDog Vector users can receive the same benefit. It also alleviates us from having to maintain these entities as well.

Schemas

Proposal

In short, I'm proposing that we attach the known schema to each Vector event during ingestion. This would allow us to lookup fields and map them across schemas. There are a lot of little details to discuss which we can cover in an RFC. To name a few:

How would Vector detect the schema?
Should Vector strictly enforce the schema? Ex: Not allowing users to add fields that would violate the schema.
Should Vector reject data at the source-level that does not conform to the chosen schema?
Should Vector adopt a default schema? Ex: OpenTelemetry.
What happens when the user has a custom schema that we know nothing about? Ex: require them to manually map data when necessary.