openHPI / poseidon

Scalable task execution orchestrator for CodeOcean
MIT License
8 stars 1 forks source link

Investigate Nomad Server redundancy #400

Closed mpass99 closed 1 year ago

mpass99 commented 1 year ago

Currently, Poseidon requests just one Nomad server for all its actions. When this server node fails, the whole Nomad cluster is not usable for Poseidon. In the context of this issue, solutions should be identified, explored, and evaluated that might fix this issue.

A solution might be the configuration of DNS Failover (or DNS Load Balancing). Alternatively, we could also implement a custom handling of multiple servers in Poseidon.

mpass99 commented 1 year ago

In my research, I could identify the following solutions.

My favorite is the DNS solution, but as the DNS configuration is not managed by us in our OpenStack cluster, we might also consider the Nginx solution. What do you think?

MrSerth commented 1 year ago

but as the DNS configuration is not managed by us in our OpenStack cluster

Actually, with the current Ansible + Terraform setup, the DNS entries are completely manageable by us. There is one exception with those names ending on .compute.internal, since these are automatically managed. However, we could also decide to use one of our Terraform-managed DNS names, which allow full customization.

Therefore, the failover handled by the HTTP libraries sounds smart (since it could be a solution with very little maintenance effort once configured). Do you know whether our (Nomad) library supports it?

mpass99 commented 1 year ago

However, we could also decide to use one of our Terraform-managed DNS names, which allow full customization.

Great! Thanks for clarifying

Do you know whether our (Nomad) library supports it?

The Nomad library uses the standard Go net/http library. I haven't found references and neither practically tried it, but with a look in the source code I found the code responsible for handling multiple resolved addresses.

Therefore, I would continue by implementing the DNS solution. See codeocean-terraform!132.

MrSerth commented 1 year ago

The DNS changes were implemented and deployed. In our tests, Poseidon tried different IP addresses when one of the Nomad servers become unreachable. Hence, we achieved our goal to improve the Nomad server redundancy and can close this issue.

The only aspect we haven't thought of was the deployment with a modified certificate that is required for the new DNS name. While the certificate creation got updated in our PR, new certificates are only created when not created before or when they expire in less than a month. Both conditions weren't true, so that no new certificates were issued. I fixed that (with a temporary local change) and ensured that new certificates were created).