Database general-purpose target attribute

alanwest commented 12 months ago

Proposal for new `db.target` attribute

The type of targets available for a database operation is dependent on a database system. For example, the target of a SQL database operation may be a table, function, or stored procedure. Other database systems may refer to their targets as collections or indexes, etc.

Currently, there exists database system specific attributes for describing this. Examples include:

Introducing a general-purpose attribute would enable backends to provide a generic experience for faceting database operations by their target without requiring the backend have knowledge of all the database system specific attributes. It would also enable end users to more easily explore their data at a high-level without having to know the specific database systems in use.

I propose a new call-level attribute db.target for describing the target of a database operation. The term "target" may not be the greatest, but I'm using it for lack of a better term for the moment. I considered db.operation.target but of course db.operation is already an attribute. Maybe db.operation could become db.operation.type thereby allowing db.operation.target.

Issues to consider

Similar to db.operation, the value of the db.target may not be readily available without parsing the value of db.statement, e.g. SELECT * FROM Users the db.operation is SELECT and the db.target is Users. The specification advises against parsing db.statement to derive other attributes. I consider this a limitation particularly once we have a database client operation metric (see #512). The db.statement attribute is fine for spans but because of its high-cardinality will not work for metrics. Therefore, having db.operation and db.target attributes on metrics will be useful.
There may be many targets for a single database operation, e.g. SELECT * FROM Users U JOIN Company C ON ....

Perhaps parsing db.statement can be an opt-in feature?

Various backends have prior art for addressing these issues. For example:

Elastic APM performs a limited parsing to extract the operation and outer-most table name from a SQL statement. This is described in their agent specifications.
New Relic APM agents perform a similar limited parsing and includes operation/target on metric data. New Relic's agent specs are unfortunately not publicly available, so I do not have a link to share. Though here is New Relic Java agent source code that performs the parsing.
New Relic agents offer a feature for providing a "hint" of the database target. The hint can be added as a comment prior to the database statement to simplify parsing, e.g., /* nrhint:Company */ SELECT * FROM Users U JOIN Company C ON ....

pyohannes commented 12 months ago

In the messaging space we have a similar scenario, there we settled on the more or less generic term destination, which covers both queues and topics.

We also ended up making messaging.destination a namespace. There's messaging.destination.name, and then there are further attributes to describe the destination. Maybe something similar makes sense if one decides to go with the db.target approach.

mateuszrzeszutek commented 12 months ago

Perhaps parsing db.statement can be an opt-in feature?

FWIW the javaagent already does that. We have a query sanitization feature that replaces all the query variables with ?. It uses a performant tokenizer that scans the query looking for the interesting parts: https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation-api-semconv/src/main/jflex/SqlSanitizer.jflex Since we're already iterating over the query once, we figured we might as well try to extract the table name while we're at it.

jcocchi commented 9 months ago

Currently, there exists database system specific attributes for describing this. Examples include:

Cosmos DB also has a custom attribute for this called db.cosmosdb.container. Standardizing on a term for the semantic convention makes sense to me.

Perhaps parsing db.statement can be an opt-in feature?

Wouldn't the source library be responsible for populating the db.target, so each library would presumably only need to know how to parse its own db.statement (vs. the APM tool doing it)? Cosmos DB currently doesn't populate db.statement at all, and it doesn't apply to most of our operations.

dicko2 commented 9 months ago

This is one of our biggest barriers from moving from app insights to otel. App insights includes the whole statement, we use mainly stored procs though, and where we don't its parameterised query strings. It would cause cardinality issues if people are injecting values into the query. But I think people shouldn't be doing this anyway.

I don't see the value of the target begin the table alone. Think about what you need to debug a problem. If you put only a table then these two queries will be grouped together

SELECT * FROM users

SELECT name FROM users WHERE userid = @userid

They will have drastically different performance right? So why group them?

I think adding the whole query is the way to go imo.

trask commented 9 months ago

It would cause cardinality issues if people are injecting values into the query. But I think people shouldn't be doing this anyway.

I believe even properly parameterized statements are too high-cardinality for metrics, e.g. it's not uncommon to have 100's of tables

dicko2 commented 9 months ago

If you don't have the query. I would question the use of the metric. What are you measuring?

trask commented 9 months ago

Naming idea from semconv meeting: db.entity

Related: "entity" term may be used for identifying resource attributes

jack-berg commented 9 months ago

Compiling the query syntax docs from a sampling of the different db systems we reference:

In postgres you select from a from_item
In mysql you select from table_references
In redis you search an index
In mongodb you query a collection
In DynamoDb you query a TableName
In Cassandra you query a table_name
In Elasticsearch you query against an index
In cosmosdb you query a container

db.target seems like a decent choice.

I also like db.collection, since it implies a collections of rows, columns, records, documents, entries in an index, etc.

dicko2 commented 8 months ago

You could also be calling a stored procedure too

joaopgrassi commented 8 months ago

If you don't have the query. I would question the use of the metric. What are you measuring?

One could measure all db.operation in a certain db.target - say "Give me a count of all SELECT in target MYTABLE. Or "Give me a count of all EXEC in target MY_SPROC". If that's useful, that's another question.

The statement in a metric makes not much sense, but makes total sense in spans.

lmolkova commented 8 months ago

Assuming I do cross-table query, would db.operation and db.target be arrays? Also, it'd be great to have it as opt-in on metrics, but array attribute on metrics is questionable.

I.e.

the default experience (for SQL) is not super-helpful (no table name, no statement, no operation)
opt-in statement parsing can add db.target and db.operation, but it's still far from perfect, especially if there is more than one operation/target

An 'ideal' solution as I see it would be if users could optionally give queries a nick-name which would appear on spans and metrics. We don't have a common way to achieve it, but if there was one and we could get this nick-name, we would put it into attribute like db.query.name.

The propagation mechanism could look like this:

Context dbScope = Context.current();
try(Scope scope = dbScope.with(ContextKey.named("db.query.name"), "get user by id").makeCurrent()) {
    // SELECT name FROM users WHERE userid = @userid
}

It could be fragile in some languages/cases and probably needs a spec change to make it possible in all languages.

What we can do now is to define such attribute, add it to metrics as conditionally_required: when available (even though it never is). This way semconv can go stable, default experience would not be great, but we'll have means in the semconv to create a good one.

We have similar cases (messaging.destination.template, http.route on client)

lmolkova commented 8 months ago

Another option could be that: assuming db.statement can be templated, user can opt-in into collecting a limited number of db.statement on metrics.

We could potentially extend metrics advisory params api to configure an explicit allow-list for db.statement values.

this way cardinality would be fully in the user hands.

trask commented 6 months ago

I think this is resolved by #870?

open-telemetry / semantic-conventions