Feature Request: Run as Lambda

movio / bramble

A federated GraphQL API gateway

https://movio.github.io/bramble/

MIT License

497 stars 55 forks source link

Feature Request: Run as Lambda #231

Open Quantumplation opened 8 months ago

Quantumplation commented 8 months ago

Hello!

Very very excited about this project, I've been wanting something like this for a long time.

I would love to be able to run the bramble serve as an AWS Lambda behind an API Gateway; I can see three paths to achieving that (from least desirable to most):

Use a custom runtime that runs Bramble as a side-car and proxies requests to localhost
A command line flag / environment variable that mounts the go lambda runtime and wires that up to the API calls
import bramble as a library, so people can implement their own small lambda / serverless wrappers for whatever their custom situation is

The first I can achieve without anything from the project, but is fairly brittle in my experience. Any plans (or apetite) to do the latter two, or to accept a pull request for either?

Quantumplation commented 8 months ago

So, I forked the library and played around with running it on my own as a lambda; here are a few things that are making it difficult out of the box:

The GetConfig method enforces the file watching pattern, and I can't turn that off
I can't implement a version of GetConfig myself, because Config.configFiles is private, meaning I can't construct the configuration directly
I can't configure the path that things get mounted on; most of my stack expects graphql to be accessible at /graphql/* (in particular, the * lets us annotate the query with a label in the UI, so that when there's a huge wallet of API requests, we can tell them apart)
Similarly, the Router returned by the gateway doesn't compose well with other libraries; This may be coming from my lack of understanding of HTTP.ServeMux, but it seems to only work with absolute paths. I'm more used to chi where routers can be nested, and a router that matches /b that is mounted onto a router that matches /a will match /a/b; with Http.ServeMux, it seems like the outer router matches /a, but then the inner router fails because the path doesn't match the absolute route /b. Perhaps I'm just combining them in the wrong way.

Would you accept pull requests to change the behavior of any or all of these?

asger-noer commented 6 months ago

Hi,

Again, sorry for the late reply.

The nature of a serverless function is to have no state. Bramble holds a lot of state about the services it federate. From a purely functional perspective I see the benefits of running low-traffic graph resolvers in a serverless environment, but for Bramble it would introduce some major architectural changes. At current, I'm not able to see how these can be addressed without severely implicating the performance of bramble. I've outlined a concern with the serverless approach below.

Concerns

How should Bramble manage the federated schema that is build on startup?
- Rebuilding the schema on each function invocation for n number of services.
- Precompiling the schema, and removing the option for patching the federated schema while running.
- Something else?

Quantumplation commented 6 months ago

We've been running bramble in a lambda for a few months now, on a fork that allows us a bit more access to the internals, and it seems to work great. Lambdas aren't actually stateless, except insomuch as they can be killed under low traffic. In practice, a lambda can live for up to 15 minutes before a new instance is spun up, and AWS handles pre-warming that instance for you if there's existing traffic to the service.

So long as Bramble doesn't rely on persisting something to disk, or very expensive-to-populate caches, I don't see it being fundamentally incompatible.

We've set it up so that it fetches the schema on startup and every couple of minutes, though I've been meaning to add the option to load the constituent / assembled schema from disk or S3 on startup instead of querying it from downstream APIs; this would be a lot faster on startup than querying each service, and would rarely change.

I think the ideal setup would be:

Load an assembled schema on startup from S3 (or similar), for fast startup
Re-fetch the schema if we receive some schema related error
Provide an endpoint that our build pipeline could hit when deploying new versions of the service to force the refresh

asger-noer commented 6 months ago

Do you know how the cold-start problem would translate to other cloud providers other than AWS? I think, we should be cloud provider agnostic and relying on the specific internals of AWS, would not be the right way to go. That said, I think you suggestions go a long way, to make sure that doesn't happen.

Load an assembled schema on startup from S3 (or similar), for fast startup

I do like option to load a pre-assembled schema. Preferably moving the schema assembler into its own package, and creating a seperate tool for schema assembly.

Re-fetch the schema if we receive some schema related error

Having Bramble re-fetch the schema from a service if it encounters some kind of schema related error would be a great addition. Though we have to be careful and protect the downstream services with circuit breakers, back-off and all this.

Provide an endpoint that our build pipeline could hit when deploying new versions of the service to force the refresh

If the above things is implemented I think this would be kinda redundant. Since we'd issue a service schema re-fetch when encountering a mapping error in the first place. This re-fetch would happen before we even tried to query the downstream service.

I this should be spilt into multiple issues tackling the pre-assemble and schema re-fetch issues individually. Since each have their own caveats and problems. What do you think of that?

Quantumplation commented 5 months ago

If the above things is implemented I think this would be kinda redundant. Since we'd issue a service schema re-fetch when encountering a mapping error in the first place. This re-fetch would happen before we even tried to query the downstream service.

Well, we'd attempt the query once, receive an error from one of the downstream services that the query didn't make sense with the schema, and then refetch the schema. So the idea behind the rebuild URL would be to be a bit pre-emptive/optimistic for speed, but at scale it's likely not going to get there before someone triggers it anyway, so yea, it's probably not needed.

I this should be spilt into multiple issues tackling the pre-assemble and schema re-fetch issues individually. Since each have their own caveats and problems. What do you think of that?

Yea absolutely; I'd probably suggest a total of 4 issues, which I'm happy to create and describe if you agree:

Adopt a few minor tweaks (available on my fork) that make it at least possible to control some of the internals of bramble when used as a library; mainly, I wanted a bit more control over the startup lifecycle and how the HTTP routes get mounted.
Decouple schema loading from the bramble service, with a generic "Schema provider" interface that can be implemented with different backends.
Implement detection and print a log message when the error we get is because of a schema mismatch (since I imagine this logic to be quite involved and finicky? not exactly sure, maybe there's a very clear differentiator.
Refetch the schema when we encounter the above errors, and then retry the request (with appropriate circuit breakers, etc.)