movio / bramble

A federated GraphQL API gateway
https://movio.github.io/bramble/
MIT License
497 stars 55 forks source link

Feature Request: Run as Lambda #231

Open Quantumplation opened 8 months ago

Quantumplation commented 8 months ago

Hello!

Very very excited about this project, I've been wanting something like this for a long time.

I would love to be able to run the bramble serve as an AWS Lambda behind an API Gateway; I can see three paths to achieving that (from least desirable to most):

The first I can achieve without anything from the project, but is fairly brittle in my experience. Any plans (or apetite) to do the latter two, or to accept a pull request for either?

Quantumplation commented 8 months ago

So, I forked the library and played around with running it on my own as a lambda; here are a few things that are making it difficult out of the box:

Would you accept pull requests to change the behavior of any or all of these?

asger-noer commented 6 months ago

Hi,

Again, sorry for the late reply.

The nature of a serverless function is to have no state. Bramble holds a lot of state about the services it federate. From a purely functional perspective I see the benefits of running low-traffic graph resolvers in a serverless environment, but for Bramble it would introduce some major architectural changes. At current, I'm not able to see how these can be addressed without severely implicating the performance of bramble. I've outlined a concern with the serverless approach below.

Concerns

Quantumplation commented 6 months ago

We've been running bramble in a lambda for a few months now, on a fork that allows us a bit more access to the internals, and it seems to work great. Lambdas aren't actually stateless, except insomuch as they can be killed under low traffic. In practice, a lambda can live for up to 15 minutes before a new instance is spun up, and AWS handles pre-warming that instance for you if there's existing traffic to the service.

So long as Bramble doesn't rely on persisting something to disk, or very expensive-to-populate caches, I don't see it being fundamentally incompatible.

We've set it up so that it fetches the schema on startup and every couple of minutes, though I've been meaning to add the option to load the constituent / assembled schema from disk or S3 on startup instead of querying it from downstream APIs; this would be a lot faster on startup than querying each service, and would rarely change.

I think the ideal setup would be:

asger-noer commented 6 months ago

Do you know how the cold-start problem would translate to other cloud providers other than AWS? I think, we should be cloud provider agnostic and relying on the specific internals of AWS, would not be the right way to go. That said, I think you suggestions go a long way, to make sure that doesn't happen.

Load an assembled schema on startup from S3 (or similar), for fast startup

I do like option to load a pre-assembled schema. Preferably moving the schema assembler into its own package, and creating a seperate tool for schema assembly.

Re-fetch the schema if we receive some schema related error

Having Bramble re-fetch the schema from a service if it encounters some kind of schema related error would be a great addition. Though we have to be careful and protect the downstream services with circuit breakers, back-off and all this.

Provide an endpoint that our build pipeline could hit when deploying new versions of the service to force the refresh

If the above things is implemented I think this would be kinda redundant. Since we'd issue a service schema re-fetch when encountering a mapping error in the first place. This re-fetch would happen before we even tried to query the downstream service.

I this should be spilt into multiple issues tackling the pre-assemble and schema re-fetch issues individually. Since each have their own caveats and problems. What do you think of that?

Quantumplation commented 5 months ago

If the above things is implemented I think this would be kinda redundant. Since we'd issue a service schema re-fetch when encountering a mapping error in the first place. This re-fetch would happen before we even tried to query the downstream service.

Well, we'd attempt the query once, receive an error from one of the downstream services that the query didn't make sense with the schema, and then refetch the schema. So the idea behind the rebuild URL would be to be a bit pre-emptive/optimistic for speed, but at scale it's likely not going to get there before someone triggers it anyway, so yea, it's probably not needed.

I this should be spilt into multiple issues tackling the pre-assemble and schema re-fetch issues individually. Since each have their own caveats and problems. What do you think of that?

Yea absolutely; I'd probably suggest a total of 4 issues, which I'm happy to create and describe if you agree: