tot-ra / graphql-schema-registry

GraphQL schema registry
MIT License
370 stars 68 forks source link

Adding extra features / schema definition breakdown & sync usage feature #146

Open SirJalias opened 2 years ago

SirJalias commented 2 years ago

Hello @tot-ra ,

I am opening this issue to share the features the teams at @ManoManoTech have been doing in the later months, following issues #123 & #124 so when we started the repository was at the v3 and now it is in v4 and has diverged a lot from what we have now, so we would like to know if it is worth to focus on joining the features or not, so I will explain what we have done.

1 - Schema Breakdown

image

In order to know if a breaking change can be allowed there is the need to store the different fields in every query/mutation/subscription so that when there is a push of a new schema this information is broken down into parts.

So there have been created tables starting with type_def_* and I will explain its relationships.

1 service can contain n operations, defined in type_def_operations with an operation_id

1 operation stored in type_def_operations can have n parameters stored in type_def_operation_parameters .

A parameter can be an input field or the response of the operation, represented with the is_output field 0 means is an input, and 1 is the response type.

In the table, type_def_types is stored the naming and the type ( SCALAR, ENUM, DIRECTIVE, or OBJECT) of all the schemas, and its definition is stored in type_def_fields

So, let’s do an example, with the data of the request to push endpoint with an schema of the brand service:

{
  "name": "brands",
  "version": "latest",
  "type_defs": "\n  schema {\n  query: Query\n}\n\ndirective @extends on INTERFACE | OBJECT\n\ndirective @external on FIELD_DEFINITION | OBJECT\n\ndirective @key(fields: String!) on INTERFACE | OBJECT\n\ndirective @provides(fields: String!) on FIELD_DEFINITION\n\ndirective @requires(fields: String!) on FIELD_DEFINITION\n\ntype Brand @key(fields: \"id\") @key(fields: \"brandId\") @key(fields: \"id\") @key(fields: \"brandId\") {\n  brandId: Int!\n  description: String\n  id: ID!\n  logo: String\n  market: String!\n  platform: String!\n  slug: String!\n  title: String!\n}\n\n\n\n\ntype Query {\n  _entities(representations: [_Any!]!): [_Entity]!\n  _service: _Service!\n  brand(brandId: Int!, market: String!, platform: String!): Brand!\n  brands(brandIds: [Int!]!, market: String!, platform: String!): [Brand!]!\n}\n\nscalar _Any\n\nunion _Entity = Brand\n\ntype _Service {\n  \"\"\"The sdl representing the federated service capabilities. Includes federation directives, removes federation types, and includes rest of full schema after schema directives have been applied\"\"\"\n  sdl: String\n}\n\n",
  "url": "http://127.0.0.1:4003/api/graphql/brands"
}

So the values stored in the different tables are:

Services:

id

name

is_active

updated_time

added_time

url

4

brands

1

NULL

2022-09-01 15:00:20

http://127.0.0.1:4003/api/graphql/brands

type_def_operations

id

name

description

type

service_id

1

_entities

NULL

QUERY

4

2

_service

NULL

QUERY

4

3

brand

NULL

QUERY

4

4

brands

NULL

QUERY

4

type_def_types

id

name

description

type

1

_Any

NULL

SCALAR

2

Int

NULL

SCALAR

3

String

NULL

SCALAR

4

ID

NULL

SCALAR

5

extends

NULL

DIRECTIVE

6

external

NULL

DIRECTIVE

7

key

NULL

DIRECTIVE

8

provides

NULL

DIRECTIVE

9

requires

NULL

DIRECTIVE

10

Brand

NULL

OBJECT

11

_Service

NULL

OBJECT

12

_Entity

NULL

OBJECT

type_def_fields

id

name

description

is_nullable

is_array

is_array_nullable

is_deprecated

parent_type_id

children_type_id

1

fields

NULL

0

0

1

0

7

3

2

fields

NULL

0

0

1

0

8

3

3

fields

NULL

0

0

1

0

9

3

4

brandId

NULL

0

0

1

0

10

2

5

description

NULL

1

0

1

0

10

3

6

id

NULL

0

0

1

0

10

4

7

logo

NULL

1

0

1

0

10

3

8

market

NULL

0

0

1

0

10

3

9

platform

NULL

0

0

1

0

10

3

10

slug

NULL

0

0

1

0

10

3

11

title

NULL

0

0

1

0

10

3

12

sdl

The sdl representing the federated service capabilities. Includes federation directives, removes federation types, and includes rest of full schema after schema directives have been applied

1

0

1

0

11

3

This will be reflected in the UI like this

image

If someone wants to know what is the "contract" of the query brands clicking on it can be seen the definition of it:

image

To give some numbers, we have in our organization around 30 queries and 60 objects provided by 14 subgraphs and this number is going to increase in the coming months

2 - Client awareness

The objective of this feature is to have information in the UI about who ( client & version ) is using a query or an object in the super-graph.

The architecture is summarized in this schema

image

Apollo Gateway receives all the requests the clients perform and there is a plugin from Apollo that is called usage reporting plugin, this plugin will take all the requests within a period of time and when this time comes or the data is larger than the configured value, it is sent to the schema registry to the /api/ingress/traces endpoint.

So as you can see there is no custom gateway needed as this is plugged with the already available tools from Apollo.

When the Usage reporting gets to the schema registry and is decoded the payload is something similar to:

{
  "header": {
    "hostname": "host-name",
    "agentVersion": "apollo-server-core@3.6.3",
    "runtimeVersion": "node v14.18.3",
    "uname": "darwin, Darwin, 21.2.0, x64)",
    "executableSchemaId": "1",
    "graphRef": "current"
  },
  "endTime": {
    "seconds": "1644511107",
    "nanos": 397000000
  },
  "tracesPerQuery": {
    "# homeBrands\nfragment HomeBrands on Brand{__typename brandId id logo title}query homeBrands($platform:Platform!){homepageB2cBrands(platform:$platform){__typename...HomeBrands}}": {
      "trace": [
        {
          "endTime": {
            "seconds": "1644511102",
            "nanos": 737000000
          },
          "startTime": {
            "seconds": "1644511102",
            "nanos": 690000000
          },
          "details": {
            "variablesJson": {
              "platform": ""
            }
          },
          "clientName": "test-gateway-client",
          "clientVersion": "0.0.1",
          "http": {
            "method": "GET"
          },
          "durationNs": "46907164",
          "root": {
            "error": [
              {
                "message": "request to http://127.0.0.1:4002/api/graphql failed, reason: connect ECONNREFUSED 127.0.0.1:4002",
                "json": "{\"message\":\"request to http://127.0.0.1:4002/api/graphql failed, reason: connect ECONNREFUSED 127.0.0.1:4002\",\"type\":\"system\",\"errno\":\"ECONNREFUSED\",\"code\":\"ECONNREFUSED\"}"
              }
            ]
          },
          "fullQueryCacheHit": false,
          "registeredOperation": false,
          "forbiddenOperation": false,
          "queryPlan": {
            "sequence": {
              "nodes": [
                {
                  "fetch": {
                    "serviceName": "graphql",
                    "sentTimeOffset": "13365436",
                    "sentTime": {
                      "seconds": "1644511102",
                      "nanos": 703000000
                    }
                  }
                },
                {
                  "flatten": {
                    "responsePath": [
                      {
                        "fieldName": "brands"
                      },
                      {
                        "fieldName": "@"
                      }
                    ],
                    "node": {}
                  }
                }
              ]
            }
          },
          "fieldExecutionWeight": 1
        }
      ],
      "referencedFieldsByType": {
        "Brand": {
          "fieldNames": [
            "id",
            "brandId",
            "title",
            "logo",
            "__typename"
          ],
          "isInterface": false
        },
        "Query": {
          "fieldNames": [
            "brands"
          ],
          "isInterface": false
        }
      }
    }
  }
}

As you can see here there is all the information we need to do the client tracking: "clientName": "test-gateway-client", "clientVersion": "0.0.1",

and also the query performed.

when the request is received the schema registry performs these actions:

  1. Check if the client and the version exist on the database
  2. Calculate a hash with the query performed
  3. Calculate the Redis key ( this key is grouped by )
    1. client id
    2. hash of the query
    3. timestamp ( by hour )

So if there is a key in the Redis store we do an increment of the operations and the errors accordingly to the message received.

The UI looks this way when the stats button is clicked:

image

At scale level if the gateway receives 5k requests, there is not done 5k requests to the schema registry it will depend on the configuration of the usage report plugin how it group all the information of those requests.

Right now we have found some bugs and we know that we need to dedicate some time in order to fix the bugs and perform enhances to this feature or else some architectural changes like using Kafka in order to not lose any usage message and prevent the main thread of the schema registry to process all the information.

About the payload we have in the apollo gateway right now in prod it is not too high, 2.5k req / min

3 - Breaking change control

As there is a control to know which data is used by the clients we can control when pushing a new version of the schema of a subgraph if there is some breaking change allow to push this new version only if there are no clients using this data otherwise this request will be rejected.

So finally, if you got to the end of this and there is a lot of data to digest, first of all, thank you and then we are very interested in knowing your opinion about putting all this stuff together with what there is in v4 and deciding how to move forward. If following this issue is getting too difficult we also propose to do a meeting in order to get an alignment together.

Thank you very much

tot-ra commented 2 years ago

Hey. Thanks for the very informative post!

✅ I think the UI part is good and something we do want to reach to get detailed, per-property information ❓ Schemas column in UI is a bit confusing.. as I understand its a link to a service that defines it? Or is it a link to Query.brands? If its a service, I'd name the column accordingly.

Screenshot 2022-09-12 at 14 11 06

❓ What happens if you have brands service, but images service for example extends type Brand? You'll need to show multiple services linked to the type in UI in schemas column? Or is it going to be under specific Brand.url property? ❓ I see you have type_defs_operations tied to the service table.. but I don't think thats how it should be.. one query can hit multiple federated services.. But I guess we can discuss final DB structure in the PR itself, not here

called usage reporting plugin, this plugin will take all the requests within a period of time and when this time comes or the data is larger than the configured value, it is sent to the schema registry to the /api/ingress/traces endpoint

Regarding architecture, I understand that this plugin seems useful and in small projects its simple to integrate. But putting it into the gateway seems like a risky move to me, because : ⚠️ if it makes sync requests for every operation, then it can overload schema-registry/db, making it more like real-time service that we don't particularly want to do. This can cause service to fail responding to /schema/compose and /schema/latest requests which is very bad ⚠️ if it does aggregation or sampling (throwing away some queries) before it makes requests to schema-registry, then we won't get sufficiently detailed information about usage

Thats why we're using async query processor - https://github.com/pipedrive/graphql-schema-registry/tree/master/src/worker/analyzeQueries where you can control the load / processing speed yourself. It doesn't have performance (speed of queries) though ofc. but I think thats a better architecture though somewhat more complex for smaller projects as it needs an event bus (kafka) that would be responsible for storing queries. It can extract client name/version here https://github.com/pipedrive/graphql-schema-registry/blob/master/src/worker/analyzeQueries/index.ts#L77 if its passed down from gateway's headers.

Screenshot 2022-09-12 at 14 12 35

Having said that, I think we can accept your sync solution the plugin into v4 only if UI in the end can show/work with both sync & async datasources. So for the end user, it should be a simple choice:

So I'd ask you to check & change async worker to add/update data in your DB tables too (see examples for the setup) such that you could see usage in your views. (or we can collaborate on the same PR)

P.S. I wonder how are you going to show in the UI types that migrate from one service to another and tie it to the usage too.