Prisma Horizontal Scaling

mcmar commented 6 years ago

Describe the bug When I try to scale Prisma horizontally by adding a second server, the first server logs:

Obtaining exclusive agent lock...
Obtaining exclusive agent lock... Successful.

then the second server logs just this and crashes:

Obtaining exclusive agent lock...

The reason appears to be this line of code: https://github.com/prismagraphql/prisma/blob/d5c97fe8f1c1ee223ec1392ebdf16f2545b2f763/server/servers/deploy/src/main/scala/com/prisma/deploy/migration/migrator/DeploymentSchedulerActor.scala#L52

It seems that Prisma is explicitly ensuring that there's only ever 1 cluster/server (prisma terminology changes) that can run against a DB.

To Reproduce Steps to reproduce the behavior:

Initialize Prisma server from Prisma AWS Fargate template (https://github.com/prismagraphql/prisma-templates/tree/master/aws)
Go to ECS cluster (name will be what you used during CF stack creation
Click on Prisma service
Click Update in top-right corner
Increase Number of tasks from 1 to 2
Click Next, Next, Next, and Done

Expected behavior 2 Prisma instances would run against 1 DB

Screenshots None

Versions (please complete the following information):

OS: Irrelevant, it's in AWS
prisma CLI: prisma/1.11.1 (darwin-x64) node-v8.11.1
Prisma Server: 1.11.0 (per Cloudformation template)

Additional context Already reported in the slack channel. Was told to create a bug here. @divyenduz

terijyu commented 6 years ago

+1, encountering this problem as well

mavilein commented 6 years ago

Hey @mcmar ,

thanks for bringing this up. This is an area where we are lacking documentation. The prisma server can be started either with the management API enabled or not. If the management API is enabled it will try to acquire the agent lock on startup. This is to ensure that there is only one Prisma server at a time writing into the management tables. So your second server also has the management API hence you are seeeing this log message.

The management API can be simply enabled in the Prisma server config, e.g.:

port: 60000
managementApiSecret: my-secret
rabbitUri: amqp://my-rabbitmq-server
enableManagementApi: true|false
databases:
  default:
    connector: mysql
    ...

In Prisma Cloud we are running Prisma like this for horizontal scalability:

Run exactly 1 one Prisma server with the Management API enabled. Additionally run multiple Prisma servers with the Management API disabled. Internally we call those server types primary and secondary.
Your load balancer in front of the servers must be setup like this:
- All requests to /management must be routed to the primary server.
- All other requests may be routed to any of the servers (primary + secondary).
in addition to this you will also need a RabbitMQ server, which we use for PubSub. We need it to publish change events about the data to all servers so that they can notify connected subscriptions over Websocket. (see the config entry rabbitUri above). If you don't want to run a RabbitMQ server on your own, i recommend CloudAMQP as a hoster.

Does that help?

emmenko commented 6 years ago

in addition to this you will also need a RabbitMQ server, which we use for PubSub

Uh, is that a requirement or is it only necessary in case you use subscriptions? @mavilein Could you clarify that please?

mavilein commented 6 years ago

@emmenko : Right now it is required. We need for RabbitMQ for those reasons:

as PubSub to power subscriptions over Websockets
to store ServerSideSubscriptions/Webhooks in a Queue. This allows us to eventually deliver Webhooks even if the Server goes down due a crash or reboot.
to propagate information about schema changes to all servers. We need this as each server holds a cache of the schema of service. Not having a cache would mean adding significant latency to query execution.

I guess you are fine with point 1. Point 2 is a tradeoff you need to decide for yourself in your usecase. point 3 is currently a blocker.

If we can find a solution for point 3 we could repackage the Prisma server without the RabbitMQ dependency. We could enable this through a separate Docker image or configuration flag.

emmenko commented 6 years ago

I see. So if I want multiple replicas I also need to have a rabbitmq cluster on the side.

Do you plan to make the pubsub system configurable or is rabbitmq the only option? For example, I’m running my services on GCP and it would be easier to use google pubsub.

Thanks anyway for the explanation! 🙏

mcmar commented 6 years ago

@mavilein Is Prisma using AMQP 0.9.1 or 1.0? 1.0 will work with Apache ActiveMQ and Amazon MQ, which would make my life much easier. 0.9.1 would require me to spin up my own RabbitMQ service with its own ELB.

mavilein commented 6 years ago

@emmenko : RabbitMQ is currently the only option, but we have encapsulated our pubsub code into a neat interface. We could provide additional implementations for e.g. google pubsub. I just added a Feature request for this.

@mcmar : We are using the RabbitMQ Java client, so i think this 0.9.1 then.

@mcmar @emmenko : Would you be happier if we would support Apache Kafka? We are considering to add it for another feature anyway.

emmenko commented 6 years ago

Thanks. For now I'm trying using the stable/rabbitmq helm chart. I think it should be fine. In our specific case, we run our services on K8s on Google Cloud, so for us an integration with Google PubSub would be perfect so that we don't have to manage that on our own 😉

emmenko commented 6 years ago

@mavilein I'm trying to deploy prisma with 1 primary and 2 secondary. The primary has the managementApi enabled, the secondaries do not.

However, after starting, the primary and one of the secondaries keep crashing with the error

Fatal error during deployment worker initialization: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://single-server/user/$b#-844349262]] after [300000 ms]. Sender[null] sent message of type "com.prisma.deploy.migration.migrator.DeploymentProtocol$Initialize$".

I noticed that all 3 prisma containers are trying to "obtain the agent lock". From what you wrote before I tought that only the primary is suppose the get the lock, or should all of them do it? In that case, any idea why am I still getting errors? 🤔

I'm using prisma:1.13.4.

emmenko commented 6 years ago

Now the primary and one of the secondary are running but the 2nd secondary keeps crashing (no errors in the logs, only 👇)

Obtaining exclusive agent lock...
Initializing workers...
Successfully started 1 workers.
Server running on :4466
Version is up to date.

Am I doing something wrong? 🤔

emmenko commented 6 years ago

Tried also with 1 primary and 1 secondary. After a couple of min, the secondary crashed with the same error.

emmenko commented 6 years ago

I have a feeling that the management API is still enabled in both. I checked and I'm passing enableManagementApi: true to the primary and enableManagementApi: false to the secondary.

emmenko commented 6 years ago

In case it helps, here are the logs for the pods (for the timings)

$ kubectl get pods -w | grep prisma
prisma-primary-c6f64d69d-ckkbn          2/2       Running   0          3m
prisma-secondary-75b857b766-xx8jz       2/2       Running   0          3m
prisma-secondary-75b857b766-xx8jz   1/2       Error     0         6m
prisma-secondary-75b857b766-xx8jz   1/2       Running   1         6m
prisma-secondary-75b857b766-xx8jz   2/2       Running   1         8m

And here the logs for the primary

Obtaining exclusive agent lock...
Obtaining exclusive agent lock... Successful.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
Deployment worker initialization complete.
Initializing workers...
Successfully started 1 workers.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
Server running on :4466
Version is up to date.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.

and here for the secondary

Obtaining exclusive agent lock...
Initializing workers...
Successfully started 1 workers.
Server running on :4466
Version is up to date.
Fatal error during deployment worker initialization: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://single-server/user/$b#-1603503801]] after [300000 ms]. Sender[null] sent message of type "com.prisma.deploy.migration.migrator.DeploymentProtocol$Initialize$".

emmenko commented 6 years ago

I noticed that when I try to access on both containers the http://localhost:4466/management I get the graphql playground. Is this supposed to be working even if the enableManagementApi is set to false?

mavilein commented 6 years ago

@emmenko : Oh my bad. I forgot to say that you need to use prisma-prod image. Only this one contains the necessary ifs. We should really improve this experience. Thx for keeping to dig 👍

emmenko commented 6 years ago

I forgot to say that you need to use prisma-prod image

Ooooh, thanks! 😇

I'll try that right away.

Btw, I'm happy to contribute to the documentation feedback with the experience I had so far. Let me know in case you need that 😉

emmenko commented 6 years ago

Works! 🙌

$ kubectl get pods -w | grep prisma
prisma-rabbitmq-0         1/1       Running   0          4h
prisma-rabbitmq-1         1/1       Running   0          4h
prisma-rabbitmq-2         1/1       Running   0          4h
prisma-primary-56d5699664-sn2sj         2/2       Running   0          35m
prisma-secondary-cbb5c87b8-c8qdn        2/2       Running   0          27m
prisma-secondary-cbb5c87b8-z8zgh        2/2       Running   0          23m

mavilein commented 6 years ago

@emmenko : Nice. 🎉 I have opened an issue to unify our 2 Docker images as they do not seem necessary to me and just cause confusion. Happy to come back to you to get your feedback on the docs when we have a first version ready! 🙏

mcmar commented 6 years ago

@mavilein I'm attempting to implement the pattern you described in which you route /management to prisma-primary and all other routes * to prisma-primary and prisma-secondary. I'm unable to implement that pattern in AWS using Application Load Balancers because ECS services can only register themselves in one target group. I'm working off of the fargate.yml template in the prisma-templates repo. How does prisma host their own servers in AWS? Do you use ECS? Do you have 2 separate prisma services? Do you use Application Load Balancers as opposed to Classic Load Balancers? I can't find a way to implement it using ECS with ALBs.

Here's the issue in ECS: https://github.com/aws/amazon-ecs-agent/issues/1351#issuecomment-412377706

mcmar commented 6 years ago

Hey @emmenko if you have a fairly generic kubernetes template for prisma with support for horizontal scaling, would you mind posting it or submitting a PR against https://github.com/prismagraphql/prisma-templates ? The current docs only show the single-server setup. I'm currently working on adding a second Cloudformation template for horizontal scaling. It'd be good to get something in there for Kubernetes too. I thought it'd be cool if we could all contribute back what we're learning and grow the OSS community around Prisma.

emmenko commented 6 years ago

@mcmar hey, sure thing! I’ll start working on that in the next days 👌

develomark commented 6 years ago

Handy thread. Thank you for the information. I'd like to cast my vote for Google PubSub also as we currently have a cloud function consuming Prisma subscriptions over HTTP and passing them into GCP PubSub. This would reduce the latency.

lethot commented 6 years ago

@emmenko @mavilein hi, I'm in the same situation, I'm a bit confused, I'm stuck at the prisma-prod image step, when I try to run the container with this image, I get an error like I'm missing some SQL_INTERNAL_PASSWORD env var. I'm using cloud sql postgres and I can't find where I missed something and what this var is.

emmenko commented 6 years ago

Where are you running the containers? Kubernetes?

lethot commented 6 years ago

@emmenko kubernetes engine yes

emmenko commented 6 years ago

How do you pass the PRISMA_CONFIG?

lethot commented 6 years ago

here is my config

- name: PRISMA_CONFIG
          value: |
            port: 4466
            rabbitUri: amqp://...
            managementApiEnabled: false
            databases:
              default:
                connector: postgres
                host: 127.0.0.1
                port: 5432
                user: "$(PG_USERNAME)"
                password: "$(PG_PASSWORD)"
                migrations: true
                connectionLimit: 4

lethot commented 6 years ago

with the 'prismagraphql/prisma:1.14' image the container is ok but I has the exclusive agent lock problem and the container restarts every 5min with the 'prismagraphql/prisma-prod' image the container don't even start and fire the missing var SQL_INTERNAL_PASSWORD error

by the way thanx for your help

emmenko commented 6 years ago

Hmm the config looks good. I'm using those images and for me things work

images:
  prisma:
    repository: prismagraphql/prisma-prod
    tag: 1.14
    pullPolicy: IfNotPresent
  cloudsql:
    repository: gcr.io/cloudsql-docker/gce-proxy
    tag: 1.11
    pullPolicy: IfNotPresent

emmenko commented 6 years ago

Btw: I have one deployment for the "normal" prisma replicas which are connected to the LB, plus a deployment for the "management" prisma (1 replica only) that is not served by the LB (it's only used by port-forwarding).

Hopefully I manage to share my chart in the next weeks, in case it helps others ;)

lethot commented 6 years ago

😱thanx you just make me realized that I forgot the tag on the prisma-prod image !!! Yes I plan to to the same for management and replicas

xcv58 commented 6 years ago

Is there any plan to update documentation about Horizontal Scaling, rabbitUri, etc.?

mcmar commented 6 years ago

Hi @emmenko, would you be able to post your kubernetes templates that you're using with prisma-prod and horizontal scaling? It looks like the current kubernetes instructions for prisma don't include rabbitMQ or horizontal scaling 😢 https://www.prisma.io/docs/tutorials/cluster-deployment/kubernetes-aiqu8ahgha

mcmar commented 6 years ago

@mavilein @divyenduz Can anyone on the Prisma team please provide a working template with horizontal scaling? Could be Cloudformation or K8s or anything. I've been trying for months to get this working, but I've got nothing. I documented the issue I'm having with ECS. I'd appreciate any help.

UPDATE: Used @dpetrick's repo and it worked. I'll see what I can do to make another chart by combining @dpetrick's solution with the helm-prisma repo. Perhaps a prisma-prod helm chart? I really like that @dpetrick's version includes the ingress config to use both the primary and secondary servers via the same port. Using port-forwarding will work for local deploys, but means that you can't use prisma-cloud.

dpetrick commented 6 years ago

I whipped together a quick guide based on experiments I did myself with Kubernetes. Give it a try and tell me if it works for you. https://github.com/dpetrick/prisma-k8s-example.

Edit: I should mention that the example is derived from an actual working setup.

develomark commented 6 years ago

Hi @emmenko, I wonder how are you progressing with the GCP kubernetes templates?

emmenko commented 6 years ago

Hey guys, I ended up writing an article on how we did it in my team because it's a bit difficult to come up with something generic that fits all use cases.

https://techblog.commercetools.com/prisma-horizontal-scaling-a-practical-guide-3a05833d4fc3

Have a read and hopefully it's somehow helpful to a lot of people 😊

jhalborg commented 6 years ago

Scaling Prisma horizontally right now seems to require quite a bit of DevOps knowledge. For DevOps novices such as myself, is there any hope that this will be made easier in Prisma sometime soonish? I notice that this issue is tagged with docs and not with feature 😳

mavilein commented 6 years ago

@jhalborg : As a first step we will improve the docs and then follow up with some changes to make it significantly easier to run Prisma in production. We will make RabbitMQ optional if you don't use subscriptions for example. Thanks for letting us know! 🙏

jhalborg commented 6 years ago

@mavilein - Thanks! But what about simply scaling Prisma servers horizontally with no subscriptions - is that possible to do somewhat easily?

As far as I understand, if I i.e. deploy to Heroku and scale up more dynos, that won't work seeing as it needs a master/slave setup for propogating changes from the /management endpoint, correct?

emmenko commented 6 years ago

As far as I understand, if I i.e. deploy to Heroku and scale up more dynos

Well it's not really a ~~master/slave~~ (primary/secondaries). You can deploy and scale up the servers that are configured without the /management endpoint. Then have separately a single server with the /management API enabled.

Without the RabbitMQ dependency coming up, things are going to be a bit easier to set up and manage hopefully. Looking forward to that!

jhalborg commented 6 years ago

To be honest, I'm still very confused. Perhaps I'll just have to wait on new docs and see if that helps.

As it stands now, Prisma seems to be the weakest link in our setup. The API scales horizontally automagically, but that doesn't help much when it needs to hit a single Prisma server to access the DB.

mcmar commented 6 years ago

@jhalborg For heroku, you would do 3 things: 1) Use a rabbitMQ heroku addon 2) Change the prisma docker image to prisma-prod 3) Add the rabbitUri: amqp://... and managementApiEnabled: false props to your PRISMA_CONFIG env var.

jhalborg commented 6 years ago

Thanks @mcmar - I might be slow, but I'm still unsure, I haven't worked with RMQ before. If I set those two variables in my config, where would the management API (primary server) then be hosted? And will it "just work" if I setup that addon and refer to it in the config?

I've searched, but can't find any guides on the topic except for the one @emmenko wrote - which I'm sure is pretty awesome, but once again requires learning Kubernetes and RabbitMQ

emmenko commented 6 years ago

I'm not much familiar with Heroku to be able to help you out further. Maybe someone who does can jump in? Have you also try asking for help in the Slack channels?

mcmar commented 6 years ago

@jhalborg I didn't provide instructions for the primary server. If you want to also host that in Heroku, then you would clone your Heroku environment and go through these steps: 1) Copy your CLOUDAMQP_URL (or similar name) env var from above. DO NOT create a new rabbitMq plugin. They need to point to the same server. 2) Change the prisma docker image to prisma-prod (same as above) 3) Add the rabbitUri: amqp://... and managementApiEnabled: true props to your PRISMA_CONFIG env var. 4) Only spin up one server for primary.

mcmar commented 6 years ago

You'll end up with 2 urls. One for primary and one for secondary servers. Use the primary URL for deployments and prisma-cloud. Use the secondary URL for your graphql-yoga or apollo-server server.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 10 days if no further activity occurs. Thank you for your contributions.

xcv58 commented 5 years ago

I try to reproduce the horizontal scale on localhost. But it seems no message send to RabbitMQ. Here is the docker-compose.yml file: https://github.com/xcv58/Prisma-Horizontal-Example/blob/master/docker-compose.yml

Could you please point out what's wrong with the configuration? Thanks!

Fixed in https://github.com/xcv58/Prisma-Horizontal-Example/commit/19de8480ab7aa30c2fb98d3d46b6614414e3abf0:

I should use enableManagementApi instead of ~~managementApiEnabled~~

Siyfion commented 5 years ago

Do people/@emmenko not still see issues when doing Prisma version updates? As most K8s / ECS setups will provision the new “management enabled“ instance in parallel before turning off the old one? Won’t that cause a lock and error?

prisma / prisma1

Prisma Horizontal Scaling #2850