vaticle / typedb

TypeDB: the polymorphic database powered by types
https://typedb.com
Mozilla Public License 2.0
3.72k stars 337 forks source link

Update diagnostics service to version 1 #7059

Closed farost closed 1 month ago

farost commented 1 month ago

Usage and product changes

We introduce an updated version of diagnostics sent from a TypeDB server.

  1. config.yml gets a new field deploymentID for the diagnostics section. This field should be used for collecting the data from multiple servers of a single TypeDB Cloud deployment.
  2. The updated diagnostics data contains more information about the server resources and details for each separate database. More details can be found in the examples below.
  3. For the JSON reporting, we calculated diffs between the current timestamp and the sinceTimestamp (the previous hour when the data had to be sent: it's updated even if we had errors sending the data for simplicity). For the Prometheus data, we send raw counts as Prometheus calculates diffs based on its queries and expects raw diagnostics from our side.
  4. For the JSON monitoring, we show only the incrementing counters from the start of the server just as for the Prometheus diagnostics data (also available through the monitoring page). This way, the content is different from the reporting data.
  5. The schema and data diagnostics about each specific database are sent only from the primary replica of a deployment at the moment of the diagnostics collection. The connection peak values diagnostics regarding a database are still reported by a non-primary replica if the database exists or there were established transactions within the last hour before the database had been deleted.
  6. If the statistics reporting is turned off in the config, we send a totally safe part of the diagnostics data once to notify the server about the moment when the diagnostics were turned off. No user data is shared in this snapshot (see examples below). This action is performed only if the server is up for 1 hour (to avoid our CI tests report data), and only if the server has not successfully sent such a snapshot after turning the statistics reporting off the last time. If there is an error in sending this snapshot, the server will try again after a restart (no extra logic here).

Example diagnostics data for Prometheus (http://localhost:4104/metrics?format=prometheus):

# distribution: TypeDB Core
# version: 2.28.0
# os: Mac OS X x86_64 14.2.1

# TYPE server_resources_count gauge
server_resources_count{kind="memoryUsedInBytes"} 68160245760
server_resources_count{kind="memoryAvailableInBytes"} 559230976
server_resources_count{kind="diskUsedInBytes"} 175619862528
server_resources_count{kind="diskAvailableInBytes"} 1819598303232

# TYPE typedb_schema_data_count gauge
typedb_schema_data_count{database="212487319", kind="typeCount"} 74
typedb_schema_data_count{database="212487319", kind="entityCount"} 2891
typedb_schema_data_count{database="212487319", kind="relationCount"} 2466
typedb_schema_data_count{database="212487319", kind="attributeCount"} 5832
typedb_schema_data_count{database="212487319", kind="hasCount"} 13325
typedb_schema_data_count{database="212487319", kind="roleCount"} 7984
typedb_schema_data_count{database="212487319", kind="storageInBytes"} 2164793
typedb_schema_data_count{database="212487319", kind="storageKeyCount"} 94028
typedb_schema_data_count{database="3717486", kind="typeCount"} 5
typedb_schema_data_count{database="3717486", kind="entityCount"} 0
typedb_schema_data_count{database="3717486", kind="relationCount"} 0
typedb_schema_data_count{database="3717486", kind="attributeCount"} 0
typedb_schema_data_count{database="3717486", kind="hasCount"} 0
typedb_schema_data_count{database="3717486", kind="roleCount"} 0
typedb_schema_data_count{database="3717486", kind="storageInBytes"} 0
typedb_schema_data_count{database="3717486", kind="storageKeyCount"} 0

# TYPE typedb_attempted_requests_total counter
typedb_attempted_requests_total{kind="CONNECTION_OPEN"} 4
typedb_attempted_requests_total{kind="DATABASES_ALL"} 4
typedb_attempted_requests_total{kind="DATABASES_GET"} 4
typedb_attempted_requests_total{kind="SERVERS_ALL"} 4
typedb_attempted_requests_total{database="212487319", kind="DATABASES_CONTAINS"} 2
typedb_attempted_requests_total{database="212487319", kind="SESSION_OPEN"} 2
typedb_attempted_requests_total{database="212487319", kind="TRANSACTION_EXECUTE"} 70
typedb_attempted_requests_total{database="212487319", kind="SESSION_CLOSE"} 1
typedb_attempted_requests_total{database="3717486", kind="DATABASES_CONTAINS"} 2
typedb_attempted_requests_total{database="3717486", kind="SESSION_OPEN"} 2
typedb_attempted_requests_total{database="3717486", kind="TRANSACTION_EXECUTE"} 54
typedb_attempted_requests_total{database="3717486", kind="SESSION_CLOSE"} 1

# TYPE typedb_successful_requests_total counter
typedb_successful_requests_total{kind="CONNECTION_OPEN"} 4
typedb_successful_requests_total{kind="DATABASES_ALL"} 4
typedb_successful_requests_total{kind="DATABASES_GET"} 4
typedb_successful_requests_total{kind="SERVERS_ALL"} 4
typedb_successful_requests_total{kind="USER_TOKEN"} 8
typedb_successful_requests_total{database="212487319", kind="DATABASES_CONTAINS"} 2
typedb_successful_requests_total{database="212487319", kind="SESSION_OPEN"} 2
typedb_successful_requests_total{database="212487319", kind="TRANSACTION_EXECUTE"} 67
typedb_successful_requests_total{database="212487319", kind="SESSION_CLOSE"} 1
typedb_successful_requests_total{database="3717486", kind="DATABASES_CONTAINS"} 2
typedb_successful_requests_total{database="3717486", kind="SESSION_OPEN"} 2
typedb_successful_requests_total{database="3717486", kind="TRANSACTION_EXECUTE"} 47
typedb_successful_requests_total{database="3717486", kind="SESSION_CLOSE"} 1

# TYPE typedb_error_total counter
typedb_error_total{database="3717486", code="TYR03"} 5
typedb_error_total{database="3717486", code="TXN08"} 2

Example diagnostics JSON data from monitoring (http://localhost:4104/metrics?format=JSON):

{
  "version": 1,
  "deploymentID": "HTAOYJNSRYY2WOUR",
  "serverID": "HTAOYJNSRYY2WOUR",
  "distribution": "TypeDB Core",
  "timestamp": "2024-05-14T09:50:46",
  "server": {
    "version": "2.28.0",
    "uptimeInSeconds": 134,
    "os": {
      "name": "Mac OS X",
      "arch": "x86_64",
      "version": "14.2.1"
    },
    "memoryUsedInBytes": 68151644160,
    "memoryAvailableInBytes": 567832576,
    "diskUsedInBytes": 175619862528,
    "diskAvailableInBytes": 1819598303232
  },
  "load": [
    {
      "database": "212487319",
      "schema": {
        "typeCount": 74
      },
      "data": {
        "entityCount": 2891,
        "relationCount": 2466,
        "attributeCount": 5832,
        "hasCount": 13325,
        "roleCount": 7984,
        "storageInBytes": 2164793,
        "storageKeyCount": 94028
      }
    },
    {
      "database": "3717486",
      "schema": {
        "typeCount": 5
      },
      "data": {
        "entityCount": 0,
        "relationCount": 0,
        "attributeCount": 0,
        "hasCount": 0,
        "roleCount": 0,
        "storageInBytes": 0,
        "storageKeyCount": 0
      }
    }
  ],
  "actions": [
    {
      "name": "CONNECTION_OPEN",
      "attempted": 4,
      "successful": 4
    },
    {
      "name": "DATABASES_ALL",
      "attempted": 4,
      "successful": 4
    },
    {
      "name": "DATABASES_GET",
      "attempted": 4,
      "successful": 4
    },
    {
      "name": "SERVERS_ALL",
      "attempted": 4,
      "successful": 4
    },
    {
      "name": "DATABASES_CONTAINS",
      "database": "212487319",
      "attempted": 2,
      "successful": 2
    },
    {
      "name": "SESSION_OPEN",
      "database": "212487319",
      "attempted": 2,
      "successful": 2
    },
    {
      "name": "TRANSACTION_EXECUTE",
      "database": "212487319",
      "attempted": 70,
      "successful": 67
    },
    {
      "name": "SESSION_CLOSE",
      "database": "212487319",
      "attempted": 1,
      "successful": 1
    },
    {
      "name": "DATABASES_CONTAINS",
      "database": "3717486",
      "attempted": 2,
      "successful": 2
    },
    {
      "name": "SESSION_OPEN",
      "database": "3717486",
      "attempted": 2,
      "successful": 2
    },
    {
      "name": "TRANSACTION_EXECUTE",
      "database": "3717486",
      "attempted": 54,
      "successful": 47
    },
    {
      "name": "SESSION_CLOSE",
      "database": "3717486",
      "attempted": 1,
      "successful": 1
    }
  ],
  "errors": [
    {
      "code": "TYR03",
      "database": "3717486",
      "count": 5
    },
    {
      "code": "TXN08",
      "database": "3717486",
      "count": 2
    }
  ]
}

Example of diagnostics JSON data sent when the reporting flag is turned on:

{
  "version":1,
  "deploymentID":"HTAOYJNSRYY2WOUR",
  "serverID":"HTAOYJNSRYY2WOUR",
  "distribution":"TypeDB Core",
  "timestamp":"2024-05-14T09:50:36",
  "periodInSeconds":3600,
  "enabled":true,
  "server":{
    "version":"2.28.0",
    "uptimeInSeconds":124,
    "os":{
      "name":"Mac OS X",
      "arch":"x86_64",
      "version":"14.2.1"
    },
    "memoryUsedInBytes":68097245184,
    "memoryAvailableInBytes":622231552,
    "diskUsedInBytes":175624044544,
    "diskAvailableInBytes":1819594121216
  },
  "load":[
    {
      "database":"212487319",
      "schema":{
        "typeCount":74
      },
      "data":{
        "entityCount":2868,
        "relationCount":2449,
        "attributeCount":5816,
        "hasCount":13247,
        "roleCount":7927,
        "storageInBytes":2164793,
        "storageKeyCount":93379
      },
      "connection":{
        "schemaTransactionPeakCount":0,
        "readTransactionPeakCount":1,
        "writeTransactionPeakCount":1
      }
    },
    {
      "database":"3717486",
      "schema":{
        "typeCount":5
      },
      "data":{
        "entityCount":0,
        "relationCount":0,
        "attributeCount":0,
        "hasCount":0,
        "roleCount":0,
        "storageInBytes":0,
        "storageKeyCount":0
      },
      "connection":{
        "schemaTransactionPeakCount":0,
        "readTransactionPeakCount":2,
        "writeTransactionPeakCount":1
      }
    }
  ],
  "actions":[
    {
      "name":"CONNECTION_OPEN",
      "successful":2,
      "failed":0
    },
    {
      "name":"DATABASES_ALL",
      "successful":2,
      "failed":0
    },
    {
      "name":"DATABASES_GET",
      "successful":2,
      "failed":0
    },
    {
      "name":"SERVERS_ALL",
      "successful":2,
      "failed":0
    },
    {
      "name":"DATABASES_CONTAINS",
      "database":"212487319",
      "successful":1,
      "failed":0
    },
    {
      "name":"SESSION_OPEN",
      "database":"212487319",
      "successful":1,
      "failed":0
    },
    {
      "name":"TRANSACTION_EXECUTE",
      "database":"212487319",
      "successful":32,
      "failed":2
    },
    {
      "name":"SESSION_CLOSE",
      "database":"212487319",
      "successful":1,
      "failed":0
    },
    {
      "name":"DATABASES_CONTAINS",
      "database":"3717486",
      "successful":1,
      "failed":0
    },
    {
      "name":"SESSION_OPEN",
      "database":"3717486",
      "successful":1,
      "failed":0
    },
    {
      "name":"TRANSACTION_EXECUTE",
      "database":"3717486",
      "successful":27,
      "failed":4
    },
    {
      "name":"SESSION_CLOSE",
      "database":"3717486",
      "successful":1,
      "failed":0
    }
  ],
  "errors":[
    {
      "code":"TYR03",
      "database":"3717486",
      "count":3
    },
    {
      "code":"TXN08",
      "database":"3717486",
      "count":1
    }
  ]
}

Example of diagnostics JSON data sent once when the reporting flag is turned off:

{
  "version":1,
  "deploymentID":"HTAOYJNSRYY2WOUR",
  "serverID":"HTAOYJNSRYY2WOUR",
  "distribution":"TypeDB Core",
  "timestamp":"2024-05-14T10:03:53",
  "periodInSeconds":3600,
  "enabled":false,
  "server":{
    "version":"2.28.0"
  }
}

Implementation

There is no huge refactoring as it's planned to be a cleaner feature in the incoming 3.0.