Guide on how to do large scale deployment

OmarDarwish commented 5 years ago

Currently, the documentation is sparse on the following:

How do you dedicate a server to be one specific GRR component like the AdminUI, Frontend, or Worker?
How does clustering work?
- What happens in a partition?
- How do multiple Workers or Frontends discover each other?
- What ports are used?
Example configurations

Would it be possible to elaborate or have a guide addressing this?

mbushkov commented 5 years ago

@OmarDarwish , I'll prioritise writing this piece of documentation (can't provide an ETA, though). Do you have a specific platform in mind (i.e. AWS or Google Cloud)?

OmarDarwish commented 5 years ago

We are a Google shop so that would be immensely helpful. In the meantime, I've been digging through source to understand how clustering and discovery works. Could you point me in the right direction?

OmarDarwish commented 5 years ago

Questions:

How do Frontende know about each other?
How can clients load balance between front ends?
How does the AdminUI fit in?

After digging through Fleetspeak source... my guesses are:

If Frontends share a db, this is good enough
There needs to be a PROXY aware lb in front of Frontends. Clients talk to this lb
???

grrrrrrrrr commented 5 years ago

1./3. All components find each other via the db. If you point them all at the same database server, there is nothing else to do.

For 2., GRR does not have any built in load balancing. The idea here is that your run multiple frontend servers and put some off the shelf load balancer in front.

Note that Fleetspeak is not yet enabled in GRR so while you can in theory use it, currently the default GRR installation uses its own comms protocol.

OmarDarwish commented 5 years ago

Thanks! How can I confirm that each component is healthy and able to talk to its peers?

mbushkov commented 5 years ago

@OmarDarwish if the component can write to the common database - you can consider it's healthy. Normally components simply fail to start if there are issues with the datastore configuration or with the datastore itself. When setting up the component you can run it by hand (https://grr-doc.readthedocs.io/en/latest/installing-grr-server/troubleshooting.html?highlight=logs#i-cannot-start-any-some-of-the-grr-services-using-the-init-d-scripts) and check that it operates correctly. To reiterate what @grrrrrrrrr has said: components communicate with each other exclusively through the datastore. If all components can access the datastore, you have a correct setup.

Answering your previous questions:

How do Frontends know about each other?

They don't need to know each other. Frontends run in parallel and write to the same datastore.

How can clients load balance between front ends?

Unfortunately we don't have a doc dedicated to this. However, it should be similar to running any other distributed HTTP server behind Nginx or Apache and load balancing your deployment. (see, for example: https://docs.nginx.com/nginx/admin-guide/load-balancer/http-load-balancer/)

How does the AdminUI fit in?

AdminUI provides all its functionality (starting flows, hunts, etc) by writing/reading to/from the datastore. We have a guide on how to run it behind Nginx/Apache (https://grr-doc.readthedocs.io/en/latest/maintaining-and-tuning/user-management/running-behind-apache.html).

One more thing: GRR server deb package comes with init.d configuration files that start all GRR components on the same machine on system startup. If you want to have a machine dedicated to a single type of components only (i.e. workers), you can effectively use the existing init.d setup as a base and just remove parts that initialize components that you don't need (see https://github.com/google/grr/blob/master/debian/grr-server.service#L12).

OmarDarwish commented 5 years ago

Is there a way to check them remotely after they're up? I'm using tcp port checking, but wondering if there's a more comprehensive check I'm missing.

I'd like to throw the workers into an autoscaling group and I'm looking for a way to heartbeat them. Same question for the Frontends, and maybe the UIs.

mbushkov commented 5 years ago

If you set Monitoring.http_port option in your server's config file, then GRR component running with this config file will start a stats server on a given port (with a socket listening on all the interfaces).

Then you can access your component's stats with: http://<host>:<monitoring port>/varz

You can check this URL to heartbeat the component and also to collect additional stats (like CPU usage) about the running process.

atkinsj commented 5 years ago

I scale workers within the same host, so Monitoring.http_port attempts to bind multiple times and fails. Is there a Monitoring.http_port_max like there is for frontends to randomise this?

grrrrrrrrr commented 5 years ago

Hm, there currently isn't. It's very simple to add this though, I'll send a PR later today.

In the meantime, if you set the port to 0 you can start multiple processes on the same machine - just with monitoring disabled.

grrrrrrrrr commented 5 years ago

The Monitoring.http_port_max config option is implemented in https://github.com/google/grr/commit/039341bdb4990c3bb8c59c79b10a3066af13f0e4

atkinsj commented 5 years ago

Thanks! I mostly wanted this so I can continue to use the same configuration for all components but continue to run multiple workers on one host: but now I can use AWS ALB healthchecks to hit http://frontend:monitoring/varz and maintain health, so thank you!

google / grr

Guide on how to do large scale deployment #639