Investigate Nomad Server redundancy

mpass99 commented 1 year ago

Currently, Poseidon requests just one Nomad server for all its actions. When this server node fails, the whole Nomad cluster is not usable for Poseidon. In the context of this issue, solutions should be identified, explored, and evaluated that might fix this issue.

A solution might be the configuration of DNS Failover (or DNS Load Balancing). Alternatively, we could also implement a custom handling of multiple servers in Poseidon.

mpass99 commented 1 year ago

In my research, I could identify the following solutions.

Consul: Nomads documentation describes just Consul's automatic clustering as method for Nomads server fault tolerance [1] [2]. We keep distance to this solution as it would require the maintenance of one more system.
Anycast: DNS Failover (or DNS Load Balancing) is typically performed not-client-side but transparent using Anycast [3]. By configuring the Anycast Routing in OpenStack we might accomplish the Failover functionality [4]. This solution seems complicated (/unknown)
Nginx: A Nginx Proxy in front of our Nomad Servers can provide the functionality of the Failover (or Load Balancing). As we consider a simple Nginx Proxy to be more fault tolerant than a Nomad server, we would improve the situation.
DNS: By returning more than one IP address per domain name, it is up to the http libraries to switch to another IP if one IP(/Nomad server) is not reachable. This way we move the failover responsibility to the http library. [5] [6] [7]
Poseidon: In Poseidon we could implement the failover functionality. But, this requires the maintenance of additional functionality that is already provided by other systems. Also, as "HA is difficult" we might introduce a error affective feature.

My favorite is the DNS solution, but as the DNS configuration is not managed by us in our OpenStack cluster, we might also consider the Nginx solution. What do you think?

MrSerth commented 1 year ago

but as the DNS configuration is not managed by us in our OpenStack cluster

Actually, with the current Ansible + Terraform setup, the DNS entries are completely manageable by us. There is one exception with those names ending on .compute.internal, since these are automatically managed. However, we could also decide to use one of our Terraform-managed DNS names, which allow full customization.

Therefore, the failover handled by the HTTP libraries sounds smart (since it could be a solution with very little maintenance effort once configured). Do you know whether our (Nomad) library supports it?

mpass99 commented 1 year ago

However, we could also decide to use one of our Terraform-managed DNS names, which allow full customization.

Great! Thanks for clarifying

Do you know whether our (Nomad) library supports it?

The Nomad library uses the standard Go net/http library. I haven't found references and neither practically tried it, but with a look in the source code I found the code responsible for handling multiple resolved addresses.

Therefore, I would continue by implementing the DNS solution. See codeocean-terraform!132.

MrSerth commented 1 year ago

The DNS changes were implemented and deployed. In our tests, Poseidon tried different IP addresses when one of the Nomad servers become unreachable. Hence, we achieved our goal to improve the Nomad server redundancy and can close this issue.

The only aspect we haven't thought of was the deployment with a modified certificate that is required for the new DNS name. While the certificate creation got updated in our PR, new certificates are only created when not created before or when they expire in less than a month. Both conditions weren't true, so that no new certificates were issued. I fixed that (with a temporary local change) and ensured that new certificates were created).

openHPI / poseidon

Investigate Nomad Server redundancy #400