Add database resource specification

portertech commented 1 year ago

What are you trying to achieve?

Add a database resource specification for attributes that are consistent and identifying across all database receivers in the collector.

Additional context.

We (Sumo Logic) are looking to create contrib and spec issues to propose adding resource attribute(s) to mysqlreceiver, please see resource_attributes. Specifically, a human-friendly identifier of the database to be used for filtering etc. Tracing uses db.name, but it has a caveat: “In some SQL databases, the database name to be used is called “schema name”. In case there are multiple layers that could be considered for database name (e.g. Oracle instance name and schema name), the database name to be used is the more specific layer (e.g. Oracle schema name)“. I suspect this is fine in the case of MySQL, since “In MySQL, physically, a schema is synonymous with a database”. This exploration lead to the realization that there's a need for a database resource specification.

Related mysqlreceiver metrics effort.

Possible Database Resource Attributes

Considering the following attributes at this time:

`db.cluster.name`

Name of the database cluster (configured manually by the user), it serves as a human-friendly database identifier.

`db.cluster.address`

Network address used by end users to connect to the database. There can be several addresses used to connect to a single database and they may be different than the address used to collect metrics. Tracing has net.peer.addr, but this attribute doesn’t make much sense in the metrics context.

`db.cluster.port`

Network port used by end users to connect to the database. Tracing has net.peer.port, but this attribute doesn’t make much sense in the metrics context.

Current Concerns

Prometheus Exporter

Raised by @jsuereth:

Specification of export of resource attributes: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/data-model.md#resource-attributes-1

Implications:

For a database metric, e.g. database.commits would not have the cluster name, cluster address, cluster port, etc.
There would be a target_info metric where those exist and you have to link the two metrics together

portertech commented 1 year ago

@djaglowski I'm curious to hear your thoughts on this and get your take on the possible attributes that I've proposed above.

djaglowski commented 1 year ago

I may be in the minority on this but I am very skeptical that any broad set of database technologies can share much of a metadata model without requiring an unreasonable number of caveats and disclaimers. The architectures don't quite line up, so the attributes won't line up either. There are enough similarities to make this attractive, but I would argue that the model will not generalize well, that the exercise of developing it will be unnecessarily problematic, and that the utility of the result will be low.

In my opinion, we'd be better off defining a set of attributes per technology, each highly accurate to that one specific technology.

but it has a caveat: “In some SQL databases, the database name to be used is called “schema name”. In case there are multiple layers that could be considered for database name (e.g. Oracle instance name and schema name), the database name to be used is the more specific layer (e.g. Oracle schema name)“

This is a great example of the type of problem I would expect to see many times over with the "unified" approach. Seemingly similar architectures quickly diverge when you get into the details. The model becomes less useful as you add more technologies, because you end up having to add more and more caveats and special accommodations.

All that said, I don't want to stand in the way if others in the community want to pursue this approach. My suggestion then would be to work through a reasonable cross section of databases and the architectural elements we may wish to represent in a unified model.

What are the broad architectural categories that should be included or excluded from the model? (e.g. relational, columnar, nosql, cloud, graph, in-memory, etc)
What are the cross-cutting elements that we expect to include in a unified data model? (e.g. clusters, nodes, instances, schemas, databases, tables, indices, etc)

We don't have to pin everything out up front, but I think we should prove to ourselves that this approach is tractable. To me, that means we can do the following: 1) Define a list of representative database technologies which we believe could share meaningful parts of this unified model. 2) Identify a list of architectural elements that apply to a at least a proportion of these databases. 3) Articulate how each database in list 1 would map its architectural elements into those defined in list 2.

What is primarily important in my opinion is the degree to which we find ourselves trying to fit square pegs into round holes. This should help clarify the extent to which a unified model may be useful.

Database Type	Cluster	Node	Instance	Database	Schema	Table	Index
mysql
postgresql
oracle
sqlserver
cassandra
couchbase
mongodb
elasticsearch
hbase
redis
aerospike

open-telemetry / opentelemetry-specification