open-telemetry / semantic-conventions

Defines standards for generating consistent, accessible telemetry across a variety of domains
Apache License 2.0
282 stars 175 forks source link

Cosmos DB: Network level Metrics #1495

Open sourabh1007 opened 1 month ago

sourabh1007 commented 1 month ago

Area(s)

area:db

Is your change request related to a problem? Please describe.

In the Cosmos DB SDK, a single operation involves several network calls. Currently, if something goes wrong (e.g., high latency) with these network calls, customers rely solely on the logs they have implemented in their applications. When investigating such issues, we are dependent on the information provided by the customer and backend telemetry. To improve monitoring and make it more aligned with potential errors, I am proposing a set of metrics that the SDK should collect to enhance observability.

Describe the solution you'd like

Proposing below list of metrics for network calls, SDK make 2 kinds of network calls 1) Gateway (i.e. HTTP) 2) TCP (i.e. RNTBD, proprietary to Microsoft)

Gateway (Meter name: Azure.Cosmos.Client.Request)

We cannot use the HTTP default metrics because we would need our custom dimensions for these metrics. Below is the proposed metrics with dimensions:

Dimensions Tag/dimension name Sample value
db.system cosmodb
db.collection.name myCollectionName
db.namespace myDatabaseName
server.address myaccountname.documents.azure.com
server.port 443
db.operation.name query_items
db.response.status_code 200 or 429 etc.
db.cosmosdb.sub_status_code 1002 etc.
db.cosmosdb.consistency_level Eventual, ConsistentPrefix, BoundedStaleness, Strong or Session
network.protocol.name http for gateway mode, rntbd for direct mode
network.protocol.host host from http://<host> : <port>
network.protocol.port port from http://<host>:<port>
cloud.region region name, where request was sent
db.cosmosdb.network.response.status_code 200 or 429 etc.
db.cosmosdb.network.response.sub_status_code 1002 etc.
db.cosmosdb.network.routingid **(opt-in)_** pkrangeid (gateway mode), partionid/replicaid (direct mode)
Metrics Name Unit Type Description
db.client.cosmosdb.request.duration {seconds} Histogram Duration of client requests.
db.client.cosmosdb.request.count {requests} Histogram Number of requests made
db.client.cosmosdb.request.body.size By Histogram Size of client request bodies.
db.client.cosmosdb.response.body.size By Histogram Size of client response bodies.
db.client.cosmosdb.request.channel_aquisition.duration {seconds} Histogram The duration of the successfully established outbound TCP connections. i.e. Channel Aquisition Time (for direct mode)
db.server.cosmosdb.request.duration {seconds} Histogram Backend Latency (for direct mode)
db.client.cosmosdb.request.pipelined.duration {seconds} Histogram Time spent on "pipelined" stage (for direct mode)
db.client.cosmosdb.request.transit.duration {seconds} Histogram Time spent on the wire (for direct mode)
db.client.cosmosdb.request.received.duration {seconds} Histogram Time spent on "Received" stage (for direct mode)
db.client.cosmosdb.request.completed.duration {seconds} Histogram Time spent on "Completed" stage (for direct mode)
db.client.cosmosdb.request.failed.duration {seconds} Histogram Time spent on "Failed" stage (for direct mode)

Describe alternatives you've considered

No response

Additional context

Ref. java SDK metrics : https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos/docs/Metrics.md

joaopgrassi commented 1 month ago

CC @open-telemetry/semconv-db-approvers