Adding extra features / schema definition breakdown & sync usage feature

Hello @tot-ra ,

I am opening this issue to share the features the teams at @ManoManoTech have been doing in the later months, following issues #123 & #124 so when we started the repository was at the v3 and now it is in v4 and has diverged a lot from what we have now, so we would like to know if it is worth to focus on joining the features or not, so I will explain what we have done.

1 - Schema Breakdown

In order to know if a breaking change can be allowed there is the need to store the different fields in every query/mutation/subscription so that when there is a push of a new schema this information is broken down into parts.

So there have been created tables starting with type_def_* and I will explain its relationships.

1 service can contain n operations, defined in type_def_operations with an operation_id

1 operation stored in type_def_operations can have n parameters stored in type_def_operation_parameters .

A parameter can be an input field or the response of the operation, represented with the is_output field 0 means is an input, and 1 is the response type.

In the table, type_def_types is stored the naming and the type ( SCALAR, ENUM, DIRECTIVE, or OBJECT) of all the schemas, and its definition is stored in type_def_fields

So, let’s do an example, with the data of the request to push endpoint with an schema of the brand service:

{
  "name": "brands",
  "version": "latest",
  "type_defs": "\n  schema {\n  query: Query\n}\n\ndirective @extends on INTERFACE | OBJECT\n\ndirective @external on FIELD_DEFINITION | OBJECT\n\ndirective @key(fields: String!) on INTERFACE | OBJECT\n\ndirective @provides(fields: String!) on FIELD_DEFINITION\n\ndirective @requires(fields: String!) on FIELD_DEFINITION\n\ntype Brand @key(fields: \"id\") @key(fields: \"brandId\") @key(fields: \"id\") @key(fields: \"brandId\") {\n  brandId: Int!\n  description: String\n  id: ID!\n  logo: String\n  market: String!\n  platform: String!\n  slug: String!\n  title: String!\n}\n\n\n\n\ntype Query {\n  _entities(representations: [_Any!]!): [_Entity]!\n  _service: _Service!\n  brand(brandId: Int!, market: String!, platform: String!): Brand!\n  brands(brandIds: [Int!]!, market: String!, platform: String!): [Brand!]!\n}\n\nscalar _Any\n\nunion _Entity = Brand\n\ntype _Service {\n  \"\"\"The sdl representing the federated service capabilities. Includes federation directives, removes federation types, and includes rest of full schema after schema directives have been applied\"\"\"\n  sdl: String\n}\n\n",
  "url": "http://127.0.0.1:4003/api/graphql/brands"
}

So the values stored in the different tables are:

Services:

id	name	is_active	updated_time	added_time	url
4	brands	1	NULL	2022-09-01 15:00:20	http://127.0.0.1:4003/api/graphql/brands

type_def_operations

id	name	description	type	service_id
1	_entities	NULL	QUERY	4
2	_service	NULL	QUERY	4
3	brand	NULL	QUERY	4
4	brands	NULL	QUERY	4

type_def_types

id	name	description	type
1	_Any	NULL	SCALAR
2	Int	NULL	SCALAR
3	String	NULL	SCALAR
4	ID	NULL	SCALAR
5	extends	NULL	DIRECTIVE
6	external	NULL	DIRECTIVE
7	key	NULL	DIRECTIVE
8	provides	NULL	DIRECTIVE
9	requires	NULL	DIRECTIVE
10	Brand	NULL	OBJECT
11	_Service	NULL	OBJECT
12	_Entity	NULL	OBJECT

type_def_fields

id	name	description	is_nullable	is_array_nullable	parent_type_id	children_type_id
1	fields	NULL	0	1	7	3
2	fields	NULL	0	1	8	3
3	fields	NULL	0	1	9	3
4	brandId	NULL	0	1	10	2
5	description	NULL	1	1	10	3
6	id	NULL	0	1	10	4
7	logo	NULL	1	1	10	3
8	market	NULL	0	1	10	3
9	platform	NULL	0	1	10	3
10	slug	NULL	0	1	10	3
11	title	NULL	0	1	10	3
12	sdl	The sdl representing the federated service capabilities. Includes federation directives, removes federation types, and includes rest of full schema after schema directives have been applied	1	1	11	3

This will be reflected in the UI like this

If someone wants to know what is the "contract" of the query brands clicking on it can be seen the definition of it:

To give some numbers, we have in our organization around 30 queries and 60 objects provided by 14 subgraphs and this number is going to increase in the coming months

2 - Client awareness

The objective of this feature is to have information in the UI about who ( client & version ) is using a query or an object in the super-graph.

The architecture is summarized in this schema

Apollo Gateway receives all the requests the clients perform and there is a plugin from Apollo that is called usage reporting plugin, this plugin will take all the requests within a period of time and when this time comes or the data is larger than the configured value, it is sent to the schema registry to the /api/ingress/traces endpoint.

So as you can see there is no custom gateway needed as this is plugged with the already available tools from Apollo.

When the Usage reporting gets to the schema registry and is decoded the payload is something similar to:

{
  "header": {
    "hostname": "host-name",
    "agentVersion": "apollo-server-core@3.6.3",
    "runtimeVersion": "node v14.18.3",
    "uname": "darwin, Darwin, 21.2.0, x64)",
    "executableSchemaId": "1",
    "graphRef": "current"
  },
  "endTime": {
    "seconds": "1644511107",
    "nanos": 397000000
  },
  "tracesPerQuery": {
    "# homeBrands\nfragment HomeBrands on Brand{__typename brandId id logo title}query homeBrands($platform:Platform!){homepageB2cBrands(platform:$platform){__typename...HomeBrands}}": {
      "trace": [
        {
          "endTime": {
            "seconds": "1644511102",
            "nanos": 737000000
          },
          "startTime": {
            "seconds": "1644511102",
            "nanos": 690000000
          },
          "details": {
            "variablesJson": {
              "platform": ""
            }
          },
          "clientName": "test-gateway-client",
          "clientVersion": "0.0.1",
          "http": {
            "method": "GET"
          },
          "durationNs": "46907164",
          "root": {
            "error": [
              {
                "message": "request to http://127.0.0.1:4002/api/graphql failed, reason: connect ECONNREFUSED 127.0.0.1:4002",
                "json": "{\"message\":\"request to http://127.0.0.1:4002/api/graphql failed, reason: connect ECONNREFUSED 127.0.0.1:4002\",\"type\":\"system\",\"errno\":\"ECONNREFUSED\",\"code\":\"ECONNREFUSED\"}"
              }
            ]
          },
          "fullQueryCacheHit": false,
          "registeredOperation": false,
          "forbiddenOperation": false,
          "queryPlan": {
            "sequence": {
              "nodes": [
                {
                  "fetch": {
                    "serviceName": "graphql",
                    "sentTimeOffset": "13365436",
                    "sentTime": {
                      "seconds": "1644511102",
                      "nanos": 703000000
                    }
                  }
                },
                {
                  "flatten": {
                    "responsePath": [
                      {
                        "fieldName": "brands"
                      },
                      {
                        "fieldName": "@"
                      }
                    ],
                    "node": {}
                  }
                }
              ]
            }
          },
          "fieldExecutionWeight": 1
        }
      ],
      "referencedFieldsByType": {
        "Brand": {
          "fieldNames": [
            "id",
            "brandId",
            "title",
            "logo",
            "__typename"
          ],
          "isInterface": false
        },
        "Query": {
          "fieldNames": [
            "brands"
          ],
          "isInterface": false
        }
      }
    }
  }
}

As you can see here there is all the information we need to do the client tracking: "clientName": "test-gateway-client", "clientVersion": "0.0.1",

and also the query performed.

when the request is received the schema registry performs these actions:

Check if the client and the version exist on the database
Calculate a hash with the query performed
Calculate the Redis key ( this key is grouped by )

client id
hash of the query
timestamp ( by hour )

So if there is a key in the Redis store we do an increment of the operations and the errors accordingly to the message received.

The UI looks this way when the stats button is clicked:

At scale level if the gateway receives 5k requests, there is not done 5k requests to the schema registry it will depend on the configuration of the usage report plugin how it group all the information of those requests.

Right now we have found some bugs and we know that we need to dedicate some time in order to fix the bugs and perform enhances to this feature or else some architectural changes like using Kafka in order to not lose any usage message and prevent the main thread of the schema registry to process all the information.

About the payload we have in the apollo gateway right now in prod it is not too high, 2.5k req / min

3 - Breaking change control

As there is a control to know which data is used by the clients we can control when pushing a new version of the schema of a subgraph if there is some breaking change allow to push this new version only if there are no clients using this data otherwise this request will be rejected.

So finally, if you got to the end of this and there is a lot of data to digest, first of all, thank you and then we are very interested in knowing your opinion about putting all this stuff together with what there is in v4 and deciding how to move forward. If following this issue is getting too difficult we also propose to do a meeting in order to get an alignment together.

Thank you very much

tot-ra / graphql-schema-registry