vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.56k stars 1.54k forks source link

Log schemas #3910

Closed vector-vic closed 1 year ago

vector-vic commented 4 years ago

A common need for Vector users is the ability to map data according to different schemas. This is a key requirement for Vector since it aims to be schema, standard, and vendor-neutral. In order to deliver on this claim, Vector must not only support a variety of schemas independently, but it must also assist in the interchange between them.

Use Cases

Transitioning to Vector

Schemas create very heavy lock-in. This is because most downstream systems depend on this schema. To name a few:

  1. Alerts.

  2. Graphs and dashboards.

  3. Storages.

  4. Humans.

Changing a schema can break all of these things which usually is not acceptable. To prevent this, Vector must adopt their current schema in a way that downstream dependencies do not notice.

Transitioning Vendors

The use case above illustrates the need for Vector to support a single schema at a time, but there are cases where a user would need to support multiple. For example, when transitioning vendors. Vector must not only support the "read" schema but also transform the data to an entirely new "write" schema.

For example, if a user is transitioning from Splunk to Elasticsearch, Vector must ingest the data under the Splunk Common Information Model and transform it to the Elastic Common Schema.

Automatic Dashboards, Alerts, & Insights

A benefit of using a vendor's agent is that it'll unlock automatic dashboards, alerts, and other features. This not only saves a considerable amount of time and effort, but you can effectively delegate the management of these things to your chosen vendor. For example, I assume that DataDog, and their community, continually improve their dashboards. In this case, it's very important that Vector can transparently adopt the DataDog schema so that DataDog Vector users can receive the same benefit. It also alleviates us from having to maintain these entities as well.

Schemas

  1. Elastic Common Schema

  2. Splunk Common Information Model (CIM)

  3. OpenTelemetry Log Data Model

  4. GELF

  5. DataDog's reserved log attributes

  6. ...and more

Proposal

In short, I'm proposing that we attach the known schema to each Vector event during ingestion. This would allow us to lookup fields and map them across schemas. There are a lot of little details to discuss which we can cover in an RFC. To name a few:

  1. How would Vector detect the schema?

  2. Should Vector strictly enforce the schema? Ex: Not allowing users to add fields that would violate the schema.

  3. Should Vector reject data at the source-level that does not conform to the chosen schema?

  4. Should Vector adopt a default schema? Ex: OpenTelemetry.

  5. What happens when the user has a custom schema that we know nothing about? Ex: require them to manually map data when necessary.

vector-vic commented 4 years ago

Link to feature: https://timber.productboard.com/feature-board/planning/features/5154387

jszwedko commented 1 year ago

Closing since this was just used for tracking.

polarathene commented 10 months ago

@jszwedko you may want to update this docs page which links here to track progress?:

image

jszwedko commented 10 months ago

@jszwedko you may want to update this docs page which links here to track progress?:

image

Thanks for pointing that out! I opened https://github.com/vectordotdev/vector/pull/19256