API Server

Current Pain Points

Performance

Latency, rate limits, and other performance metrics are managed in a rather opaque and obscure manner by MongoDB. There's no straightforward way to enhance the API server with a caching layer, rate-limiting layer, etc.

Lack of Local Development Tooling

Having local development tools would enable us to run true end-to-end tests without resorting to mocking API calls.

Tests could run faster by utilizing in-memory databases or local Docker instances.
It would simplify the onboarding process for new developers and external contributors by quickly setting up new environments.
Development cycles would be expedited through easier database resetting and seeding.

This has been a requested feature since 2020, with no updates yet:

Local Development Tooling

No API Versioning

API versioning is crucial for maintaining backward compatibility and ensuring that existing users aren't impacted by changes.

No Built-In Pagination

As the database expands, pagination becomes essential. Currently, the workaround is to create custom resolvers, but this leads to compromises and complicates the integration of auto-generated GraphQL with custom resolvers.

This feature has been requested since 2020, with no updates yet:

Pagination in GraphQL

No Dynamic GraphQL

Atlas mandates defining a schema for every collection exposed via GraphQL, which leads to several issues:

We're forced to serialize values into the value_json field.
Allowing users to create their taxonomies would require redeploying the entire Atlas App Services project.
Ideally, the taxa collection should be the single source of truth for taxonomy definitions, which the GraphQL schema should then build upon.

This would enable us to construct queries like:

{
    incidents {
        incident_id
        title
        classifications {
            CSET {
                Severity
                Harm_Distribution_Basis
            }
        }
    }
}

Instead of:

{
        classifications {
            namespace
            short_name
            value_json
        }
}

Trigger Nightmares

Triggers often crash when multiple documents are edited simultaneously, with no automated recovery mechanism.

No updates since 2021:

Add Retry Policy to Atlas Trigger

Reusing Types in Custom Resolvers

When adding custom resolvers to the auto-generated GraphQL schema, it's sometimes necessary to return various types of existing objects (Incidents, Reports, Classifications, etc.). However, there's no way to reuse existing definitions, forcing us to manually replicate structures and deviating from the single source of truth principle.

This feature has been planned since 2020 but is still unavailable:

Ability to Reuse Types in Custom Resolver Schemas

Portability

Currently, the backend code can only be deployed to Atlas App Services. Moving away from this would allow us to rely solely on the MongoDB database, which is widely available and supports on-premises instances.

Architecture Overview

Current Architecture

The primary website incidentdatabse.ai consumes the auto-generated GraphQL API from Atlas.

The Atlas App Services API reads from and writes to the MongoDB database. External consumers access the GraphQL API exposed via a Netlify function that wraps the auto-generated API and forwards requests to it.

Transition Architecture

To avoid rewriting the API all at once, we'll progressively transition to our own API. We'll do this by using a similar technique of wrapping the auto-generated GraphQL API and then implementing each query ourselves until we can completely phase out Atlas GraphQL.

Final Architecture

Thanks for putting this all down in one place. Conceptually, the transition and final architecture are as we discussed. I think next steps include (but are not limited to):

scoping out which tools and tech we want to work with (we know it likely includes Apollo but also some other tools, e.g. https://github.com/thenativeweb/get-graphql-from-jsonschema#readme)
Which queries are candidates for moving over first?
Which queries/functionality are going to be the most problematic? (definitely ones involving user data, for other reasons, including auth)

re: dynamic graphQL and taxonomies: We want these schema-ful queries for taxonomies, especially so that users can query the values of classifications via the graphQL – which I think is not possible with the value_json serialization we have.

However, we may also still want to dynamically request any or all classifications without knowing the taxonomy schema, field types, etc.. and let the client dynamically dispatch on them. This is especially useful when exploring the data or building generic UIs, e.g. retrieve all classifications annotating incident X. This is one of the good things about our current implementation, except we handle them too custom in the client and our serialization of the fields is unwieldy – we need to do better than value_json.

So we should be able to have something to the effect of:

incidents {
    incident_id
    title
    # Static schema so that known classification fields can be queried *very* easily.
    classifications {
        CSET {
            Severity
            Harm_Distribution_Basis
        }
    }
    # Generic way to query classifications by their fields/values/types.
    classifications_dyn (query: { namespace_in: ["CSETv0", "CSETv1"] } ) {
        namespace
        # Perhaps we define custom enums and types for querying taxonomies, structure reflection?
        attributes (query: { value_type: STRING } ) {
            short_name
            # The evergreen problem: how do we want to serialize a dynamic result better for the client?
            # What about being able to query this field better here?
            value_str
        }
    }
}

In addition, if we have the static schema for each taxonomy, then will updating the taxonomy fields/definitions in any way cause clients to break? Will the taxonomies require versioning? Something to think about also with regards to allowing dynamic querying of the data.

NOTE: Preparing taxonomies for being represented in our new API world is not a blocker. :-)

responsible-ai-collaborative / aiid

API Server design document #2672