hive as a persisted operations/documents store

n1ru4l commented 1 year ago

Having GraphQL Hive act as a persistent operation store would be great.

Critical user flow

publish app deployement (app name + app version + set of persisted documents) via Hive CLI
use Hive CLI to mark an app deployment as active (as soon as it is active, the persisted documents should be available via CDN)
The Hive SDK should with a single option opt-into using the persisted document store from Hive CDN, no additional configuration needed
- @graphql-hive/envelope
- @graphql-hive/yoga
- @graphql-hive/apollo
- Apollo Router
user configures urql, apollo and relay use persisted documents with Hive, through our guidance documentation (we also reference to frameworks/libraries for further information)
- https://commerce.nearform.com/open-source/urql/docs/advanced/persistence-and-uploads/
once an app deployment is no longer used → retire app via CLI or Hive App UI (persisted documents are no longer available via persisted documents CDN)

Technical Challenges
How do we “store”/insert all the operations?
- We don’t want to send one POST request to API that includes 1000 GraphQL documents 🤔
- We should only POST batches
~API: We need an async task runner for writing operations from “storage” to “CDN” and also cleaning them up if app deployment got retired (we cannot send 1000 requests to S3 from within a single HTTP call, can't we? :))~ - We decided that it makes most sense to immediately write to S3 in the incoming request; we can check whether deployment is active via another S3 key lookup
How do we extract operations with Hive? Do we rely on people providing a JSON file?
- In codegen-client preset we provide it (https://the-guild.dev/graphql/codegen/plugins/presets/preset-client#persisted-documents)
- graphql.tada does the same (https://gql-tada.0no.co/guides/persisted-documents)
- Should we still provide a "generic" way of generating the json file by extracting operations from code files? In that case people would have problems with matching their app documents to the persisted document hashes anyways...

Nice to have (stretch goals; follow up tasks)

Use persisted documents store for web app
Use persisted documents store for Hive CLI
HTTP persisted operations store specification (for the ecosystem ™️; not just us)
Opt-into breaking changes based on active app deployment (usage) instead of operation usage (if one operation of app deployment is used → all schema coordinates within app deployments being removed is a breaking change, as the app is still active)
Because we now know which clients and client version operations are executed we can notice when incoming operations from usage no longer reference the operations in a persisted operations set and thus notify hive users when a specific version of their product is not being used anymore/low in usage

-->

Allow users to decide whether to do breaking change detection based on app deployments or usage data.

Delete persisted operation deployment flow

Drop the persisted operation deployment via UI or CLI
Schedule Async Task for actually deleting the persisted operation documents from S3
(optional) Releasec for conditional breaking changes

We must figure out how to incorporate the persisted operation schema coordinates within the hive check/breaking change detection flow (usage data). As long as a persisted operation deployment is active, the field is in usage, even if there is no data in the retention period. After a persisted operation deployment has been deleted/marked as retired/inactive, the deployment schema coordinates removal is no longer blocked by the deployment.

Based on the usage data, we can notify users when a client version seems unused (e.g. old mobile client).

Documentation

We should mainly advertise this as a security and "performance" improvement feature
Don't execute arbitrary queries
Reduce client -> origin upstream traffic (heavy graphql documents being sent over the wire!!!)

Details

Some ideas on how to store stuff...

S3 Key Structure

Here we write the graphql documents as long as the deployment is active - we need to ensure that it is removed from S3 if the deployment gets inactive. Thus a transactional background job seems inevitable.

persisted/{orgId}/{project}/{target}/{client}/{clientVersion}/{operationHash}

SQL

CREATE TABLE "persisted_document_deployments" (
  "target_id" uuid NOT NULL,
  "client_name" text NOT NULL,
  "client_version" text NOT NULL,
  "is_active" boolean -- if it is active you should not be able to add new operations to it
);

CREATE TABLE "persisted_documents" (
  "id" uuid,
  "persisted_document_deployment_id" uuid REFERENCES "persisted_document_deployment"("id")
  "hash" text NOT NULL,
  "operation_document" text,
  "document_s3_location" text NOT NULL, -- we should store a reference (in case we at some point have to change the key structure/pattern
  "schema_coordinates" text[], -- see notes
  "created_at" TIMESTAMPTZ NOT NULL DEFAULT NOW() -- this column is most likely unnecessary
);

-- Everything here is not necessarily required for the initial version - but could help for breaking change detection...

CREATE INDEX "persisted_documents_pagination" on "persisted_documents" USING GIN ("schema_coordinates");

-- get list of all operations that are related to a set of schema coordinates
SELECT
  "persisted_documents"."hash",
FROM
  "persisted_documents"
  INNER JOIN
    "persisted_document_deployments"
      ON "persisted_document_deployments"."id" = "persisted_documents"."persisted_document_deployment_id"
WHERE
  "persisted_document_deployments"."is_active" = TRUE
  AND "persisted_documents"."schema_coordinates" && '{A.foo,B.ff}'

When a deployment has been created and "frozen", we could generate the schema coordinate ---> hash mapping, for quick lookups of which a schema coordinate impacts operations. 🤔 Alternatively, we can execute the SQL live for each active deployment.

Unsure whether we should store all the schema coordinates used within a document alongside the document. 🤔

PROs:

When we introduce usage reporting by only sending the hash of the operation (instead of all the schema coordinates), we don't need to process the whole document with a graphql visitor to write data to clickhouse
Could be indexed for a lookup table to find which app deployments are affected by a breaking change... Cons:
We store a lot more of data
Inserts might become slower

Links:

kamilkisiela commented 1 year ago

It could also show a complexity score next to each document.

kamilkisiela commented 1 year ago

Could also reject documents with complexity higher than X

n1ru4l commented 1 year ago

S3 could be used as a schema registry

kamilkisiela commented 1 year ago

Yes and Hive should control it all

n1ru4l commented 1 year ago

a few analytic stuff we can do here as well.

e.g. display how many bytes saved from client <-> server requests by using persisted operations over time

kamilkisiela commented 1 year ago

Plus some part of data processing (related to the usage reporting pipeline) might be done ahead of time and the structure of the usage report could be much different much much lower in size (and more performant on the user side - no processing of documents involved)

n1ru4l commented 3 days ago

https://the-guild.dev/graphql/hive/product-updates/2024-07-30-persisted-documents-app-deployments-preview

kamilkisiela / graphql-hive