openHPI / poseidon

Scalable task execution orchestrator for CodeOcean
MIT License
9 stars 1 forks source link

Network access becomes unavailable #490

Closed MrSerth closed 5 months ago

MrSerth commented 11 months ago

An execution environment providing network access first seems to work correctly. However, after some time (and some unknown events), the environment looses the network access. The allocation itself is still running on Nomad, but unfortunately without any possibility to reach the internet.

Within a bash container, you can test network access through:

curl api.ipify.org

So far, we don't know yet when the error occurs. However, resynchronizing the environment from CodeOcean fixes the issue.

Bildschirmfoto 2023-11-01 um 18 52 43

mpass99 commented 6 months ago

For me, the issue occurs directly, even right after synchronizing the environment.
We can reduce the error scope to the cni/secure-bridge as we can resolve the issue by replacing cni/secure-bridge in the network mode definition with bridge.
To be more precise, the internet access also works with the cni/secure-bridge network mode when replacing the specified routes with just a wildcard: { "dst": "0.0.0.0/0" }. I will continue investigating tomorrow why the routes-configuration breaks the internet access.

mpass99 commented 6 months ago

It seems the reason is that the nameserver is not reachable for DNS resolution.
When Nomad starts a container it registers three nameservers in our local network (10.224.x.x).
Since these address ranges are not defined in the cni/secure-bridge.conflist, the containers cannot reach the nameserver and therefore not resolve domain names.

We might either statically add these addresses to the route configuration or parse the output of resolvectl to determine the DNS Servers dynamically.

mpass99 commented 6 months ago

We discussed this issue and found that it is most surprising that in some cases the containers can resolve the domain names even though the nameservers should not be routable.
However, we agreed that the underlying issue is not that the containers cannot reach our nameserver, but that the containers are not using 8.8.8.8 as the nameserver configured via the Docker daemon. We assumed that this might be caused by the introduction of the DNS option in the CNI secure bridge release.

Details

Changing this option, we see that nothing changes. However, when digging deeper in the container configuration, we see that the container option `ResolvConfPath` differs from the default (value: `/opt/nomad/data/alloc/xyz/default-task/resolv.conf`). With this hint pointing at Nomad, we read [the documentation](https://developer.hashicorp.com/nomad/docs/job-specification/network) more carefully again and notice > [dns](https://developer.hashicorp.com/nomad/docs/job-specification/network#dns) ([DNSConfig](https://developer.hashicorp.com/nomad/docs/job-specification/network#dns-parameters): nil) - Sets the DNS configuration for the allocations. By default all task drivers will inherit DNS configuration from the client host. DNS configuration is only supported on Linux clients at this time. Note that if you are using a mode="cni/*, these values will override any DNS configuration the CNI plugins return. that because we are using a `cni/*` mode, the Nomad DNS configuration always overwrites other configurations.

Therefore, we had to configure the DNS configuration via the Nomad Allocation configuration (that we define with Poseidon). With these changes, we are now again able to access the internet with network-enabled runners.

MrSerth commented 6 months ago

Currently, we are in the progress of enabling full IPv6 connectivity (between our internal hosts but also from containers to the internet). As part of this setup, we might also need to configure our secure bridge to work with IPv6 (while excluding internal resources, probably excluding the /64 prefix delegated).

MrSerth commented 6 months ago

Our latest changes work well and ensure we always have the desired DNS settings 💪 Unfortunately, however, they do not completely prevent allocations from loosing their network. I was just able to reproduce the issue:

  1. Create a network-enabled execution environment
  2. Execute a network command, e.g., curl api.ipify.org
  3. Restart the Docker service on the respective Nomad host: sudo systemctl restart docker
  4. Try running the same network command again; it will fail.

This discovery might be well linked to https://github.com/hashicorp/nomad/issues/19962, describing the issue already. I would assume (without any confirmation yet), that this happened to us, too: If there is a new Docker release, we install it, usually requiring a Docker service restart. As a consequence, the lost network could occur.

I haven't fully checked the linked issue, whether there is some reasonable workaround for this problem, but I am afraid that the issue has not been solved yet completely.

mpass99 commented 6 months ago

Good finding 💪 I'm glad I got to know this issue after the times we wondered if we are seeing the same issues 😄

The reasoning described in the issue seems plausible: When using CNI, Nomad handles the network (interfaces) instead of Docker. When we restart Docker, Nomad recreates the containers, not the CNI network interfaces (on the host). The containers are then not able to establish network access anymore.

As Nomad currently prioritizes this issue, it might be solved in the future. In the meantime we could create a check in our Nomad Agent Ansible playbook:

  1. Check if containers are running
  2. For each running container, check if the NetworkMode is none
  3. For one running, network-enabled container, check if curl api.ipify.org succeeds
  4. If not successful, restart the Nomad service
MrSerth commented 6 months ago

Good finding 💪 I'm glad I got to know this issue after the times we wondered if we are seeing the same issues 😄

We just improve the service, so any change for better reliability is warmly welcomed :+1:

As Nomad currently prioritizes this issue, it might be solved in the future. In the meantime we could create a check in our Nomad Agent Ansible playbook: [...]

Yes, I would also continue with an intermediate solution on our own. In chats with my colleagues today, we discovered another potential solution: Systemd. Proposed was PartOf=, but maybe another option such as BindsTo= or Requires= works too. Here is a comparison with several tests and a table that might be useful (so that we don't need to repeat that).

The idea would be to link Docker and Nomad, since this would automatically resolve the issue (at least those caused by Docker restarting). We could give it a try and observe the behavior. For overwriting a systemd file, one can just add a drop-in config (manually by executing sudo systemctl edit foo.service or by just placing the new settings in /etc/systemd/system/foo.service.d/override.conf).

mpass99 commented 5 months ago

Thank you for this other solution! It is less complicated and more reliable. Let's go with PartOf as it restarts Nomad when Docker restarts (such as Requires), but does not start Nomad when Docker starts (unlike Requires).

MrSerth commented 5 months ago

Awesome, sounds great! I've merged (and deployed) the corresponding PR, and thus will close this issue for now.