Is your change request related to a problem? Please describe.
In the Cosmos DB SDK, a single operation involves several network calls. Currently, if something goes wrong (e.g., high latency) with these network calls, customers rely solely on the logs they have implemented in their applications. When investigating such issues, we are dependent on the information provided by the customer and backend telemetry. To improve monitoring and make it more aligned with potential errors, I am proposing a set of metrics that the SDK should collect to enhance observability.
Describe the solution you'd like
Proposing below list of metrics for network calls, SDK make 2 kinds of network calls
1) Gateway (i.e. HTTP)
2) TCP (i.e. RNTBD, proprietary to Microsoft)
Gateway (Meter name: Azure.Cosmos.Client.Request)
We cannot use the HTTP default metrics because we would need our custom dimensions for these metrics. Below is the proposed metrics with dimensions:
Dimensions
Tag/dimension name
Sample value
db.system
cosmodb
db.collection.name
myCollectionName
db.namespace
myDatabaseName
server.address
myaccountname.documents.azure.com
server.port
443
db.operation.name
query_items
db.response.status_code
200 or 429 etc.
db.cosmosdb.sub_status_code
1002 etc.
db.cosmosdb.consistency_level
Eventual, ConsistentPrefix, BoundedStaleness, Strong or Session
Area(s)
area:db
Is your change request related to a problem? Please describe.
In the Cosmos DB SDK, a single operation involves several network calls. Currently, if something goes wrong (e.g., high latency) with these network calls, customers rely solely on the logs they have implemented in their applications. When investigating such issues, we are dependent on the information provided by the customer and backend telemetry. To improve monitoring and make it more aligned with potential errors, I am proposing a set of metrics that the SDK should collect to enhance observability.
Describe the solution you'd like
Proposing below list of metrics for network calls, SDK make 2 kinds of network calls 1) Gateway (i.e. HTTP) 2) TCP (i.e. RNTBD, proprietary to Microsoft)
Gateway (Meter name: Azure.Cosmos.Client.Request)
We cannot use the HTTP default metrics because we would need our custom dimensions for these metrics. Below is the proposed metrics with dimensions:
http
for gateway mode,rntbd
for direct modehttp://<host> : <port>
http://<host>:<port>
db.client.cosmosdb.request.duration
{seconds}
db.client.cosmosdb.request.count
{requests}
db.client.cosmosdb.request.body.size
By
db.client.cosmosdb.response.body.size
By
db.client.cosmosdb.request.channel_aquisition.duration
{seconds}
db.server.cosmosdb.request.duration
{seconds}
db.client.cosmosdb.request.pipelined.duration
{seconds}
db.client.cosmosdb.request.transit.duration
{seconds}
db.client.cosmosdb.request.received.duration
{seconds}
db.client.cosmosdb.request.completed.duration
{seconds}
db.client.cosmosdb.request.failed.duration
{seconds}
Describe alternatives you've considered
No response
Additional context
Ref. java SDK metrics : https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos/docs/Metrics.md