ipfs-search / ipfs-search-deployment

Ansible playbooks for the deployment of ipfs-search.com
Other
4 stars 2 forks source link

Specification/architecture for frontend cluster #36

Open dokterbob opened 1 year ago

dokterbob commented 1 year ago

Thus far, we're serving all requests from a single nginx frontend node, running on a different provider.

As we now have have experience with Hetzner Cloud and linking it to he bare metal, we are ready to setup an actual frontend cluster and/or load balancing.

This issue serves as a place to clarify the specifications, after which the frontend can be deployed.

Considerations / observations

Questions

  1. Balancer: Hetzner or some FOSS solution, like HAProxy, or just DNS round robin?
  2. SSL termination: on frontend nodes or balancer?
  3. API requests: serve from frontend nodes instead of index nodes? (Requires more RAM)
  4. OpenSearch: run coordinating-only node on frontend nodes, or query from index nodes? (Requires more RAM and CPU)
  5. Caching: use Redis as shared cache for nginx or cache on local nodes, or a combination? Perhaps local NVMe makes a distributed disk-based cache a good solution, e.g. consistent distribution of queries based on hash of URL.
  6. Any further questions?
spookyrecharge commented 1 year ago
  1. if 1 haproxy for ingress and 3 frontend nodes behind balancer then haproxy is superior here (but availability of whole load balancing set will depend on a single haproxy node)

  2. better to do that in end point in my opinion

  3. not sure what's utilization on your frontend and index nodes

  4. we can try both variants and do some tests

  5. disk based is just alright since it's nvme. but of course, it depends on your service SLA ;)

  6. I understand that you want to load balance frontend nodes. But what way should we pick in order to have high availability?

dokterbob commented 1 year ago

Thanks for the feedback!

I'll ask some further questions below:

  1. if 1 haproxy for ingress and 3 frontend nodes behind balancer then haproxy is superior here (but availability of whole load balancing set will depend on a single haproxy node)

I would prefer not to have a SPF on the load balancer. So if haproxy, we should run it in a HA setup. Note that Hetzner's balancer is redundant by default.

  1. better to do that in end point in my opinion

Do you mean at the load balancer or at the frontend server? If you meant to terminate SSL on the frontend, why do you think this is better?

  1. not sure what's utilization on your frontend and index nodes

Our current frontend is a low performance VM and has very little utilisation. However, within about a week we'll be integrated in the official IPFS GUI, so we should see some actual traffic soon.

The index nodes have a highly variable load, typically between 30-70% (system load divided by no of CPU's).

  1. we can try both variants and do some tests

Thinking about it now, perhaps that's premature optimisation. But let's keep this as an option for the future.

  1. disk based is just alright since it's nvme. but of course, it depends on your service SLA ;)

Reasonable usability is our objective. If our users max out the NVMe, we might want to call CloudFlare. ;)

  1. I understand that you want to load balance frontend nodes. But what way should we pick in order to have high availability?

This is the main question indeed. If we start out with several (e.g. 3) frontend servers configured exactly as our current frontend, I see several options:

a. Round robin DNS (e.g. poor man's load balancing). b. Hetzner load balancer. c. haproxy (or similar) in HA configuration (e.g. 2 or 3 nodes).

Perhaps we could start with a. and move to b. as it becomes necessary?

spookyrecharge commented 1 year ago
  1. I have no experience with Hetzner's balancer, but we can try it :)
  2. Terminate SSL on frontend nodes, because you may want to increase processing power dealing with terminating SSL sessions in the future.
  3. Still hard to say for me. We can start with serving API on frontend nodes?
  4. +
  5. CloudFlare is pretty good, indeed
  6. Yes, sounds like a plan.
dokterbob commented 1 year ago

Thanks again for your thoughts on this.

There's just one thing I'm not fully decided about yet: the SSL termination.

Consideraitons around SSL termination

Frontend termination of SSL

Pro's

  1. Better distribution of SSL termination load.
  2. Improved privacy unless SSL is used from LB to frontend hosts).

Con's

  1. Increased load on frontend servers.
  2. Increased attack surface (e.g. TLS DoS, more locations with private certificate, increased complexity).
  3. Requires purchased SSL certificate or dirty workarounds to make ACME certs work or DNS verification to make domains work.
  4. No caching of SSL session when new connections go to a different host.
  5. Load balancer can't use request info to distribute load.

Hetzner LB termination

Pro's

  1. Dirt easy to configure.
  2. Reduced complexity.
  3. SSL offloading to specialised hardware/software.
  4. Reduced attack surface and security offloading to Hetzner.

Con's

  1. Letsencrypt certs only work when the domain is hosted by Hetzner's DNS. However, Hetzner doesn't support ANY records which are required by our frontend CDN.

Tentative conclusion

Having read the above, I do think perhaps it makes sense to simply buy an old-school certificate and to continue working with that. I am currently investigating prices. Any further thoughts or feedback is welcome though.

dokterbob commented 1 year ago

After some thinking and research, it does seem that ACME allows for multiple certificates for the same domain. We can use certbot-dns-cloudflare for DNS-based challenges and can create wildcard certificates.

We can then configure Hetzner's load balancer and, possibly later, a CDN for TCP-based, least-connections balancing with PROXY protocol.

As for LetsEncrypt/ACME with CloudFlare using Ansible, I would suggest we keep using certbot, just set it up with Cloudflare plugin.

dokterbob commented 1 year ago

I would suggest we start with a cluster of 3 frontend nodes.